The Chronicles of Crafting an HTML Parser
Development Record of ZMarkupParser HTML to NSAttributedString Rendering Engine
This article covers HTML string tokenization, normalization, abstract syntax tree generation, the application of Visitor Pattern / Builder Pattern, and some miscellaneous discussions.
Continuing from the Previous Article
Last year, I published an article titled “[TL;DR] Implementing iOS NSAttributedString HTML Render” which briefly introduced the use of XMLParser to parse HTML and convert it into NSAttributedString.Key. The program architecture and approach mentioned in that article were quite disorganized as it was merely a record of the challenges encountered without investing much time into exploring the topic in depth.
Convert HTML String to NSAttributedString
Revisiting the topic, our goal is to convert HTML strings provided by an API into NSAttributedString and apply corresponding styles to display them in UITextView/UILabel.
For example, <b>Test<a>Link</a></b>
should be displayed as Test Link.
Note 1 Using HTML as the communication and rendering medium between the app and data is not recommended. HTML specifications are too flexible, and apps cannot support all HTML styles without an official HTML conversion rendering engine.
Note 2 Starting from iOS 14, you can use the native AttributedString to parse Markdown, or you can introduce the apple/swift-markdown Swift Package to parse Markdown.
Note 3 Due to the large size of our company’s project and the extensive use of HTML as a medium for many years, we are unable to completely switch to Markdown or other markup languages at the moment.
Note 4 The HTML used here is not meant to display entire HTML webpages; it is merely used as a Markdown-rendering string with styles. (To render entire pages with complex content, including images and tables, WebView with loadHTML is still required.)
It is strongly recommended to use Markdown as the string rendering language. However, if your project faces similar challenges to mine, where you have to use HTML and lack an elegant tool for converting to NSAttributedString, then please proceed with using HTML.
For those who remember the previous article, you can directly skip to the section ZhgChgLi / ZMarkupParser.
NSAttributedString.DocumentType.html
The HTML to NSAttributedString approaches found on the internet usually involve directly using NSAttributedString’s built-in options to render HTML, as shown below:
1
2
3
4
5
6
7
let htmlString = "<b>Test<a>Link</a></b>"
let data = htmlString.data(using: String.Encoding.utf8)!
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try! NSAttributedString(data: data, options: attributedOptions, documentAttributes: nil)
Issues with this approach:
- Poor performance: This method renders styles through WebView Core and then switches back to the Main Thread for UI display. Rendering around 300 characters takes 0.03 seconds.
- Content loss: For example, marketing copies using
<Congratulation!>
would have the HTML tag removed. - Limited customization: For instance, you cannot specify the boldness level of HTML bold text when converting to NSAttributedString.
- Intermittent crashes since iOS ≥ 12 with no official solution.
- Extensive crashes observed in iOS 15, particularly when the device’s battery is low (iOS ≥ 15.2 has fixed this issue).
- Crash when the string is too long; testing showed that inputting a string of length 54,600+ would cause a 100% crash (EXC_BAD_ACCESS).
The most painful issue is undoubtedly the crashing problem. Since iOS 15 was released until version 15.2 with the fix, this problem has consistently plagued the app. According to the data, between 2022/03/11 and 2022/06/08, there were more than 2.4K crashes, impacting over 1.4K users.
The second problem is performance. As HTML is used as a markup language for string styles, it is heavily applied to UILabel/UITextView in the app. As mentioned earlier, rendering one label takes 0.03 seconds, and when multiplied across multiple *UILabel/UITextView, it leads to noticeable lag for the users’ interactions.
XMLParser
The second approach is the one introduced in the previous article, which involves using XMLParser to parse HTML and apply the corresponding NSAttributedString Key to implement the styles.
You can refer to the implementation in SwiftRichString and the content covered in the previous article.
The previous article only explored the possibility of using XMLParser to parse HTML and perform corresponding conversions. While an experimental implementation was completed, it was not designed as a well-structured “tool” with extendability.
Issues with this approach:
- Fault tolerance rate of 0:
<br>
/<Congratulation!>
/<b>Bold<i>Bold+Italic</b>Italic</i>
In the three scenarios above, when XMLParser parses the HTML, it will throw an error and display blank. - Using XMLParser, HTML strings must fully comply with XML rules and cannot be displayed with fault tolerance like in a browser or NSAttributedString.DocumentType.html.
Standing on the Shoulders of Giants
Neither of the two solutions can perfectly and elegantly solve the HTML issues, so I started searching for existing solutions.
- johnxnguyen / Down Supports converting Markdown to Any (XML/NSAttributedString…), but does not support converting HTML.
- malcommac / SwiftRichString Uses XMLParser as the underlying mechanism, but testing showed the same fault tolerance rate of 0 issues.
- scinfu / SwiftSoup Supports HTML Parser (Selector) but does not support conversion to NSAttributedString.
After an extensive search, it seems that all the results are similar to the projects mentioned above, Orz, there’s no giant’s shoulder to stand on.
ZhgChgLi/ZMarkupParser
With no giants to rely on, I had to become the giant myself and developed the HTML String to NSAttributedString tool.
Developed purely in Swift, it uses Regex to parse HTML tags and tokenization to analyze and correct tag correctness (fixing unclosed tags and misplaced tags). It then converts the parsed data into an abstract syntax tree and uses the Visitor Pattern to map HTML tags to abstract styles, resulting in the final NSAttributedString. The tool does not rely on any external parser library.
Features
- Supports HTML Render (to NSAttributedString) / Stripper (removing HTML tags) / Selector functionalities.
- Higher performance compared to
NSAttributedString.DocumentType.html
. - Automatically analyzes and corrects tag correctness (fixing unclosed tags and misplaced tags).
- Supports dynamic styling from
style=”color:red…”
. - Supports custom style specifications, for example, requiring extra boldness.
- Offers flexibility for extending or customizing tags and attributes.
For detailed information on installation and usage, please refer to the article: “ZMarkupParser HTML String to NSAttributedString Tool.”
To try it out directly, you can git clone the project, open the ZMarkupParser.xcworkspace
project, select the ZMarkupParser-Demo
target, and build & run the project.
Technical Details
Now let’s get to the technical details behind the development of this tool.
Overview of the Process
The above image provides a rough overview of the process, and in the following articles, each step will be explained in detail with accompanying code.
⚠️️️️️️ This article will simplify the demo code and reduce abstractions and performance considerations, focusing on explaining the working principles. For the final implementation, please refer to the Source Code of the project.
Code-Based Tokenization
When it comes to HTML rendering, the most crucial step is parsing. In the past, HTML was parsed using XMLParser as if it were XML. However, this approach couldn’t handle the fact that HTML, in everyday use, is not always 100% XML-compliant, leading to parsing errors and an inability to dynamically correct them.
After ruling out the XMLParser approach, the only option left for us in Swift was to use regular expressions (Regex) for matching and parsing.
Initially, I didn’t delve too deep and thought I could directly use regular expressions to extract “paired” HTML tags, then recursively search for HTML tags inside them until the process is complete. However, this method couldn’t handle nested HTML tags or support misaligned, error-tolerant situations. Therefore, I changed the strategy to extract “individual” HTML tags and record whether they are Start Tags, Close Tags, or Self-Closing Tags, along with other string combinations, forming an array of parsing results.
The structure of Tokenization is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
enum HTMLParsedResult {
case start(StartItem) // <a>
case close(CloseItem) // </a>
case selfClosing(SelfClosingItem) // <br/>
case rawString(NSAttributedString)
}
extension HTMLParsedResult {
class SelfClosingItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
}
class StartItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
// The Start Tag could be an exceptional HTML Tag or just normal text, e.g., <Congratulation!>. After subsequent normalization, if it is an isolated Start Tag, it will be marked as True.
var isIsolated: Bool = false
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
// Used for automatic padding correction during subsequent normalization
func convertToCloseParsedItem() -> CloseItem {
return CloseItem(tagName: self.tagName)
}
// Used for automatic padding correction during subsequent normalization
func convertToSelfClosingParsedItem() -> SelfClosingItem {
return SelfClosingItem(tagName: self.tagName, tagAttributedString: self.tagAttributedString, attributes: self.attributes)
}
}
class CloseItem {
let tagName: String
init(tagName: String) {
self.tagName = tagName
}
}
}
The regular expression used is as follows:
1
<(?:(?<closeTag>\/)?(?<tagName>[A-Za-z0-9]+)(?<tagAttributes>(?:\s*(\w+)\s*=\s*(["|']).*?\5)*)\s*(?<selfClosingTag>\/)?>
- closeTag: Matches </a>
- tagName: Matches or
- tagAttributes: Matches <a href=”https://zhgchg.li” style=”color:red” >
- selfClosingTag: Matches
*Note: This regex can still be optimized, which we can address in the future.
The latter part of the article provides additional information about the regex for those interested in delving deeper.
The combined code is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
var tokenizationResult: [HTMLParsedResult] = []
let expression = try? NSRegularExpression(pattern: pattern, options: expressionOptions)
let attributedString = NSAttributedString(string: "<a>Li<b>nk</a>Bold</b>")
let totalLength = attributedString.string.utf16.count // utf-16 support emoji
var lastMatch: NSTextCheckingResult?
// Start Tags Stack, first in, last out (FILO)
// Check if the HTML string requires subsequent normalization for fixing misplacements or completing Self-Closing Tags
var stackStartItems: [HTMLParsedResult.StartItem] = []
var needForFormatter: Bool = false
expression.enumerateMatches(in: attributedString.string, range: NSMakeRange(0, totalLength)) { match, _, _ in
if let match = match {
// Check the string between tags or to the first tag, e.g., "Test<a>Link</a>zzz<b>bold</b>Test2" -> "Test,zzz"
let lastMatchEnd = lastMatch?.range.upperBound ?? 0
let currentMatchStart = match.range.lowerBound
if currentMatchStart > lastMatchEnd {
let rawStringBetweenTag = attributedString.attributedSubstring(from: NSMakeRange(lastMatchEnd, (currentMatchStart - lastMatchEnd)))
tokenizationResult.append(.rawString(rawStringBetweenTag))
}
// <a href="https://zhgchg.li">, </a>
let matchAttributedString = attributedString.attributedSubstring(from: match.range)
// a, a
let matchTag = attributedString.attributedSubstring(from: match.range(withName: "tagName"))?.string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
// false, true
let matchIsEndTag = matchResult.attributedString(from: match.range(withName: "closeTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
// href="https://zhgchg.li", nil
// Use regex to extract HTML attributes into [String: String], please refer to the Source Code for details
let matchTagAttributes = parseAttributes(matchResult.attributedString(from: match.range(withName: "tagAttributes")))
// false, false
let matchIsSelfClosingTag = matchResult.attributedString(from: match.range(withName: "selfClosingTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
if let matchAttributedString = matchAttributedString,
let matchTag = matchTag {
if matchIsSelfClosingTag {
// e.g. <br/>
tokenizationResult.append(.selfClosing(.init(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)))
} else {
// e.g. <a> or </a>
if matchIsEndTag {
// e.g. </a>
// Retrieve the position of the corresponding Start Tag from the Stack, starting from the last occurrence
if let index = stackStartItems.lastIndex(where: { $0.tagName == matchTag }) {
// If it's not the last one, it means there are misplacements or missing closing Tags
if index != stackStartItems.count - 1 {
needForFormatter = true
}
tokenizationResult.append(.close(.init(tagName: matchTag)))
stackStartItems.remove(at: index)
} else {
// Redundant close tag, e.g., </a>
// It doesn't affect subsequent steps, so we ignore it
}
} else {
// e.g. <a>
let startItem: HTMLParsedResult.StartItem = HTMLParsedResult.StartItem(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)
tokenizationResult.append(.start(startItem))
// Push it onto the Stack
stackStartItems.append(startItem)
}
}
}
lastMatch = match
}
}
// Check the ending RawString, e.g., "Test<a>Link</a>Test2" -> "Test2"
if let lastMatch = lastMatch {
let currentIndex = lastMatch.range.upperBound
if totalLength > currentIndex {
// There are remaining characters
let resetString = attributedString.attributedSubstring(from: NSMakeRange(currentIndex, (totalLength - currentIndex)))
tokenizationResult.append(.rawString(resetString))
}
} else {
// lastMatch = nil, meaning no tags were found, and it's all plain text
let resetString = attributedString.attributedSubstring(from: NSMakeRange(0, totalLength))
tokenizationResult.append(.rawString(resetString))
}
// Check if the Stack is empty, if not, it means there are Start Tags without corresponding End Tags
// Mark them as isolated Start Tags
for stackStartItem in stackStartItems {
stackStartItem.isIsolated = true
needForFormatter = true
}
print(tokenizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("a")
// .rawString("Bold")
// .close("b")
// ]
Operation process as shown in the diagram above.
In the end, you will get a Tokenization result array.
Corresponding implementation in the source code: HTMLStringToParsedResultProcessor.swift
Standardization — Normalization
Also known as Formatter, normalization.
After obtaining the preliminary parsing result in the previous step, if further normalization is required during the parsing process, this step is necessary to automatically correct HTML tag issues.
There are three types of HTML tag issues:
- HTML tag with a missing close tag, for example,
<br>
- Regular text being treated as an HTML tag, for example,
<Congratulation!>
- HTML tags with misplacement, for example,
<a>Li<b>nk</a>Bold</b>
The correction process is straightforward; we need to iterate through the elements of the Tokenization result and attempt to fill in the missing parts.
Operation process as shown in the diagram above
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
var normalizationResult = tokenizationResult
// Start Tags Stack, First In Last Out (FILO)
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
var itemIndex = 0
while itemIndex < newItems.count {
switch newItems[itemIndex] {
case .start(let item):
if item.isIsolated {
// If it is an isolated Start Tag
if WC3HTMLTagName(rawValue: item.tagName) == nil && (item.attributes?.isEmpty ?? true) {
// If it is not a WCS-defined HTML Tag and has no HTML Attribute
// Treat it as regular raw string type
normalizationResult[itemIndex] = .rawString(item.tagAttributedString)
} else {
// Otherwise, convert it to a self-closing tag, e.g., <br> -> <br/>
normalizationResult[itemIndex] = .selfClosing(item.convertToSelfClosingParsedItem())
}
itemIndex += 1
} else {
// Normal Start Tag, add to the Stack
stackExpectedStartItems.append(item)
itemIndex += 1
}
case .close(let item):
// Encountered a Close Tag
// Get the tags between the Start Stack Tag and this Close Tag
// e.g., <a><u><b>[CurrentIndex]</a></u></b> -> Interval 0
// e.g., <a><u><b>[CurrentIndex]</a></u></b> -> Interval b,u
let reversedStackExpectedStartItems = Array(stackExpectedStartItems.reversed())
guard let reversedStackExpectedStartItemsOccurredIndex = reversedStackExpectedStartItems.firstIndex(where: { $0.tagName == item.tagName }) else {
itemIndex += 1
continue
}
let reversedStackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItems.prefix(upTo: reversedStackExpectedStartItemsOccurredIndex))
// Interval 0 means the tags are not misplaced
guard reversedStackExpectedStartItemsOccurred.count != 0 else {
// If it is a pair, pop it
stackExpectedStartItems.removeLast()
itemIndex += 1
continue
}
// If there are other intervals, automatically fill in the missing tags in between
// e.g., <a><u><b>[CurrentIndex]</a></u></b> ->
// e.g., <a><u><b>[CurrentIndex]</b></u></a><b></u></u></b>
let stackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItemsOccurred.reversed())
let afterItems = stackExpectedStartItemsOccurred.map({ HTMLParsedResult.start($0) })
let beforeItems = reversedStackExpectedStartItemsOccurred.map({ HTMLParsedResult.close($0.convertToCloseParsedItem()) })
normalizationResult.insert(contentsOf: afterItems, at: newItems.index(after: itemIndex))
normalizationResult.insert(contentsOf: beforeItems, at: itemIndex)
itemIndex = newItems.index(after: itemIndex) + stackExpectedStartItemsOccurred.count
// Update Start Stack Tags
// e.g., -> b,u
stackExpectedStartItems.removeAll { startItem in
return reversedStackExpectedStartItems.prefix(through: reversedStackExpectedStartItemsOccurredIndex).contains(where: { $0 === startItem })
}
case .selfClosing, .rawString:
itemIndex += 1
}
}
print(normalizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("b")
// .close("a")
// .start("b",nil)
// .rawString("Bold")
// .close("b")
// ]
Corresponding implementation in the source code: HTMLParsedResultFormatterProcessor.swift
Abstract Syntax Tree
AKA AST, or Abstract Tree.
After completing Tokenization & Normalization data preprocessing, the next step is to transform the result into an abstract syntax tree 🌲.
As shown above.
Converting it into an abstract syntax tree allows us to perform future operations and extensions more conveniently. For example, implementing the Selector feature or performing other transformations, such as HTML to Markdown. Additionally, if we want to add Markdown to NSAttributedString in the future, we only need to implement Markdown’s Tokenization & Normalization to achieve it.
First, we define a Markup Protocol with Child & Parent properties to record information about leaves and branches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
protocol Markup: AnyObject {
var parentMarkup: Markup? { get set }
var childMarkups: [Markup] { get set }
func appendChild(markup: Markup)
func prependChild(markup: Markup)
func accept<V: MarkupVisitor>(_ visitor: V) -> V.Result
}
extension Markup {
func appendChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.append(markup)
}
func prependChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.insert(markup, at: 0)
}
}
In addition, we use the Visitor Pattern to define each style attribute as a Markup Element, and then obtain individual application results through different Visit strategies.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
protocol MarkupVisitor {
associatedtype Result
func visit(markup: Markup) -> Result
func visit(_ markup: RootMarkup) -> Result
func visit(_ markup: RawStringMarkup) -> Result
func visit(_ markup: BoldMarkup) -> Result
func visit(_ markup: LinkMarkup) -> Result
//...
}
extension MarkupVisitor {
func visit(markup: Markup) -> Result {
return markup.accept(self)
}
}
Basic Markup nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Root node
final class RootMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Leaf node
final class RawStringMarkup: Markup {
let attributedString: NSAttributedString
init(attributedString: NSAttributedString) {
self.attributedString = attributedString
}
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Definition of Markup style nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Branch nodes:
// Link style
final class LinkMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Bold style
final class BoldMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Corresponding to the Markup implementation in the original code.
Before converting it into an abstract syntax tree, we still need…
MarkupComponent
Since our tree structure does not depend on any data structure (e.g., a node/LinkMarkup should have URL information to proceed with rendering). For this, we define another container to store tree nodes and related data information:
1
2
3
4
5
6
7
8
9
10
11
12
13
protocol MarkupComponent {
associatedtype T
var markup: Markup { get }
var value: T { get }
init(markup: Markup, value: T)
}
extension Sequence where Iterator.Element: MarkupComponent {
func value(markup: Markup) -> Element.T? {
return self.first(where:{ $0.markup === markup })?.value as? Element.T
}
}
Corresponding to the MarkupComponent implementation in the original code.
Alternatively, Markup can be declared as Hashable
, and we can directly use a Dictionary to store values [Markup: Any]
. However, in this case, Markup cannot be used as a regular type and requires adding any Markup
.
HTMLTag & HTMLTagName & HTMLTagNameVisitor
We have also abstracted the HTML Tag Name part, allowing users to decide which tags need to be processed and facilitating future extensions. For example, the <strong>
tag name can correspond to BoldMarkup
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public protocol HTMLTagName {
var string: String { get }
func accept<V: HTMLTagNameVisitor>(_ visitor: V) -> V.Result
}
public struct A_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.a.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
public struct B_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.b.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
Corresponding to the HTMLTagNameVisitor implementation in the original code.
Additionally, reference to the W3C wiki lists the HTML tag name enum: WC3HTMLTagName.swift
HTMLTag is simply a container object because we want to allow external specification of the style corresponding to HTML tags. So, we declare a container to put them together:
1
2
3
4
5
6
7
8
9
struct HTMLTag {
let tagName: HTMLTagName
let customStyle: MarkupStyle? // We'll explain Render later.
init(tagName: HTMLTagName, customStyle: MarkupStyle? = nil) {
self.tagName = tagName
self.customStyle = customStyle
}
}
Corresponds to the implementation of HTMLTag in the source code.
HTMLTagNameToHTMLMarkupVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct HTMLTagNameToMarkupVisitor: HTMLTagNameVisitor {
typealias Result = Markup
let attributes: [String: String]?
func visit(_ tagName: A_HTMLTagName) -> Result {
return LinkMarkup()
}
func visit(_ tagName: B_HTMLTagName) -> Result {
return BoldMarkup()
}
//...
}
Corresponds to the implementation of HTMLTagNameToHTMLMarkupVisitor in the source code.
Converting to Abstract Syntax Tree with HTML Data
We need to convert the normalized HTML data result into an abstract syntax tree. First, let’s declare a data structure, MarkupComponent, that can hold HTML data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct HTMLElementMarkupComponent: MarkupComponent {
struct HTMLElement {
let tag: HTMLTag
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
}
typealias T = HTMLElement
let markup: Markup
let value: HTMLElement
init(markup: Markup, value: HTMLElement) {
self.markup = markup
self.value = value
}
}
Converting to Markup Abstract Syntax Tree:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
var htmlElementComponents: [HTMLElementMarkupComponent] = []
let rootMarkup = RootMarkup()
var currentMarkup: Markup = rootMarkup
let htmlTags: [String: HTMLTag]
init(htmlTags: [HTMLTag]) {
self.htmlTags = Dictionary(uniqueKeysWithValues: htmlTags.map{ ($0.tagName.string, $0) })
}
// Start Tags Stack, ensuring correct popping of tags
// Normalization has been done earlier, so it should not result in errors, just to be sure
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
for thisItem in from {
switch thisItem {
case .start(let item):
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
// Using the Visitor to determine the corresponding Markup
let markup = visitor.visit(tagName: htmlTag.tagName)
// Adding oneself as a leaf node of the current branch
// Becoming the current branch
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
currentMarkup = markup
stackExpectedStartItems.append(item)
case .selfClosing(let item):
// Adding directly as a leaf node of the current branch
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
let markup = visitor.visit(tagName: htmlTag.tagName)
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
case .close(let item):
if let lastTagName = stackExpectedStartItems.popLast()?.tagName,
lastTagName == item.tagName {
// When encountering a Close Tag, go back to the previous level
currentMarkup = currentMarkup.parentMarkup ?? currentMarkup
}
case .rawString(let attributedString):
// Adding directly as a leaf node of the current branch
currentMarkup.appendChild(markup: RawStringMarkup(attributedString: attributedString))
}
}
// print(htmlElementComponents)
// [(markup: LinkMarkup, (tag: a, attributes: ["href":"zhgchg.li"]...)]
The operation result is shown in the above image.
Corresponds to the implementation of HTMLParsedResultToHTMLElementWithRootMarkupProcessor.swift in the source code.
At this point, we have actually completed the functionality of Selector 🎉
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class HTMLSelector: CustomStringConvertible {
let markup: Markup
let components: [HTMLElementMarkupComponent]
init(markup: Markup, components: [HTMLElementMarkupComponent]) {
self.markup = markup
self.components = components
}
public func filter(_ htmlTagName: String) -> [HTMLSelector] {
let result = markup.childMarkups.filter({ components.value(markup: $0)?.tag.tagName.isEqualTo(htmlTagName) ?? false })
return result.map({ .init(markup: $0, components: components) })
}
//...
}
We can filter leaf node objects layer by layer.
Corresponds to the implementation of HTMLSelector in the source code.
Parser — HTML to MarkupStyle (Abstract of NSAttributedString.Key)
Next, we need to complete the process of converting HTML to MarkupStyle (NSAttributedString.Key).
NSAttributedString sets the style of the text using NSAttributedString.Key Attributes. We have abstracted all the fields of NSAttributedString.Key to correspond to MarkupStyle, MarkupStyleColor, MarkupStyleFont, and MarkupStyleParagraphStyle.
Purpose:
- Originally, the data structure of Attributes was
[NSAttributedString.Key: Any?]
, which, if exposed directly, would be difficult to control the values the user brings in. If incorrect values are provided, it could lead to crashes, for example,.font: 123
. - Styles need to be inheritable, for example,
<a><b>test</b></a>
, where the style of the text “test” is inherited from the link’s bold formatting (bold+link). If we directly expose the Dictionary, it would be challenging to control inheritance rules effectively. - Encapsulate objects belonging to iOS/macOS (UIKit/Appkit).
MarkupStyle Struct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
public struct MarkupStyle {
public var font: MarkupStyleFont
public var paragraphStyle: MarkupStyleParagraphStyle
public var foregroundColor: MarkupStyleColor? = nil
public var backgroundColor: MarkupStyleColor? = nil
public var ligature: NSNumber? = nil
public var kern: NSNumber? = nil
public var tracking: NSNumber? = nil
public var strikethroughStyle: NSUnderlineStyle? = nil
public var underlineStyle: NSUnderlineStyle? = nil
public var strokeColor: MarkupStyleColor? = nil
public var strokeWidth: NSNumber? = nil
public var shadow: NSShadow? = nil
public var textEffect: String? = nil
public var attachment: NSTextAttachment? = nil
public var link: URL? = nil
public var baselineOffset: NSNumber? = nil
public var underlineColor: MarkupStyleColor? = nil
public var strikethroughColor: MarkupStyleColor? = nil
public var obliqueness: NSNumber? = nil
public var expansion: NSNumber? = nil
public var writingDirection: NSNumber? = nil
public var verticalGlyphForm: NSNumber? = nil
//...
// Inherited from...
// Default: When a field is nil, it is filled with data from the "from" MarkupStyle object.
mutating func fillIfNil(from: MarkupStyle?) {
guard let from = from else { return }
var currentFont = self.font
currentFont.fillIfNil(from: from.font)
self.font = currentFont
var currentParagraphStyle = self.paragraphStyle
currentParagraphStyle.fillIfNil(from: from.paragraphStyle)
self.paragraphStyle = currentParagraphStyle
//...
}
// Convert MarkupStyle to NSAttributedString.Key: Any
func render() -> [NSAttributedString.Key: Any] {
var data: [NSAttributedString.Key: Any] = [:]
if let font = font.getFont() {
data[.font] = font
}
if let ligature = self.ligature {
data[.ligature] = ligature
}
//...
return data
}
}
public struct MarkupStyleFont: MarkupStyleItem {
public enum FontWeight {
case style(FontWeightStyle)
case rawValue(CGFloat)
}
public enum FontWeightStyle: String {
case ultraLight, light, thin, regular, medium, semibold, bold, heavy, black
// ...
}
public var size: CGFloat?
public var weight: FontWeight?
public var italic: Bool?
//...
}
public struct MarkupStyleParagraphStyle: MarkupStyleItem {
public var lineSpacing: CGFloat? = nil
public var paragraphSpacing: CGFloat? = nil
public var alignment: NSTextAlignment? = nil
public var headIndent: CGFloat? = nil
public var tailIndent: CGFloat? = nil
public var firstLineHeadIndent: CGFloat? = nil
public var minimumLineHeight: CGFloat? = nil
public var maximumLineHeight: CGFloat? = nil
public var lineBreakMode: NSLineBreakMode? = nil
public var baseWritingDirection: NSWritingDirection? = nil
public var lineHeightMultiple: CGFloat? = nil
public var paragraphSpacingBefore: CGFloat? = nil
public var hyphenationFactor: Float? = nil
public var usesDefaultHyphenation: Bool? = nil
public var tabStops: [NSTextTab]? = nil
public var defaultTabInterval: CGFloat? = nil
public var textLists: [NSTextList]? = nil
public var allowsDefaultTighteningForTruncation: Bool? = nil
public var lineBreakStrategy: NSParagraphStyle.LineBreakStrategy? = nil
//...
}
public struct MarkupStyleColor {
let red: Int
let green: Int
let blue: Int
let alpha: CGFloat
//...
}
This corresponds to the implementation of MarkupStyle in the source code.
Additionally, we also referred to W3c wiki, where browser predefined color names are enumerated with their corresponding color text and color R, G, B values: MarkupStyleColorName.swift.
HTMLTagStyleAttribute & HTMLTagStyleAttributeVisitor
Let’s talk a bit about these two objects since HTML tags allow them to be combined with CSS style settings. To do this, we use the same abstraction as in HTMLTagName and apply it again to HTML Style Attributes.
For instance, HTML might provide: <a style=”color:red;font-size:14px”>RedLink</a>
, which means this link should be styled with red color and a font size of 14px.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public protocol HTMLTagStyleAttribute {
var styleName: String { get }
func accept<V: HTMLTagStyleAttributeVisitor>(_ visitor: V) -> V.Result
}
public protocol HTMLTagStyleAttributeVisitor {
associatedtype Result
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result
//...
}
public extension HTMLTagStyleAttributeVisitor {
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result {
return styleAttribute.accept(self)
}
}
Corresponding implementation of HTMLTagStyleAttribute in the source code.
HTMLTagStyleAttributeToMarkupStyleVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct HTMLTagStyleAttributeToMarkupStyleVisitor: HTMLTagStyleAttributeVisitor {
typealias Result = MarkupStyle?
let value: String
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result {
// Extract Color Hex or Mapping from HTML Pre-defined Color Name using regex, please refer to the source code.
guard let color = MarkupStyleColor(string: value) else { return nil }
return MarkupStyle(foregroundColor: color)
}
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result {
// Extract 10px -> 10 using regex, please refer to the source code.
guard let size = self.convert(fromPX: value) else { return nil }
return MarkupStyle(font: MarkupStyleFont(size: CGFloat(size)))
}
// ...
}
Corresponding implementation of HTMLTagAttributeToMarkupStyleVisitor.swift in the source code.
The value of init
is set to the value of attribute
, and it is converted to the corresponding MarkupStyle
field based on the visit
type.
HTMLElementMarkupComponentMarkupStyleVisitor
After introducing the MarkupStyle
object, we need to convert the results from HTMLElementComponents
of Normalization
into MarkupStyle
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// MarkupStyle policy
public enum MarkupStylePolicy {
case respectMarkupStyleFromCode // Take the style from Code as the main one and use HTML Style Attribute to fill in the gaps
case respectMarkupStyleFromHTMLStyleAttribute // Take the style from HTML Style Attribute as the main one and use Code to fill in the gaps
}
struct HTMLElementMarkupComponentMarkupStyleVisitor: MarkupVisitor {
typealias Result = MarkupStyle?
let policy: MarkupStylePolicy
let components: [HTMLElementMarkupComponent]
let styleAttributes: [HTMLTagStyleAttribute]
func visit(_ markup: BoldMarkup) -> Result {
// The `.bold` is just a default style defined in MarkupStyle. Please refer to the Source Code.
return defaultVisit(components.value(markup: markup), defaultStyle: .bold)
}
func visit(_ markup: LinkMarkup) -> Result {
// The `.link` is just a default style defined in MarkupStyle. Please refer to the Source Code.
var markupStyle = defaultVisit(components.value(markup: markup), defaultStyle: .link) ?? .link
// Get the corresponding HTMLElement for LinkMarkup from HTMLElementComponents
// Find the href parameter in the attributes of HtmlElement (in the form of an HTML URL string)
if let href = components.value(markup: markup)?.attributes?["href"] as? String,
let url = URL(string: href) {
markupStyle.link = url
}
return markupStyle
}
// ...
}
extension HTMLElementMarkupComponentMarkupStyleVisitor {
// Get the specified customized MarkupStyle from the HTMLTag container
private func customStyle(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?) -> MarkupStyle? {
guard let customStyle = htmlElement?.tag.customStyle else {
return nil
}
return customStyle
}
// Default action
func defaultVisit(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?, defaultStyle: MarkupStyle? = nil) -> Result {
var markupStyle: MarkupStyle? = customStyle(htmlElement) ?? defaultStyle
// Get the LinkMarkup corresponding to HtmlElementComponents
// Check if the HtmlElement has the `Style` Attribute
guard let styleString = htmlElement?.attributes?["style"],
styleAttributes.count > 0 else {
// If not, return the markupStyle as is
return markupStyle
}
// If there are Style Attributes
// Split the Style Value string into an array
// e.g. font-size:14px;color:red -> ["font-size":"14px","color":"red"]
let styles = styleString.split(separator: ";").filter { $0.trimmingCharacters(in: .whitespacesAndNewlines) != "" }.map { $0.split(separator: ":") }
for style in styles {
guard style.count == 2 else {
continue
}
// e.g font-szie
let key = style[0].trimmingCharacters(in: .whitespacesAndNewlines)
// e.g. 14px
let value = style[1].trimmingCharacters(in: .whitespacesAndNewlines)
if let styleAttribute = styleAttributes.first(where: { $0.isEqualTo(styleName: key) }) {
// Use the HTMLTagStyleAttributeToMarkupStyleVisitor from the previous context to convert to MarkupStyle
let visitor = HTMLTagStyleAttributeToMarkupStyleVisitor(value: value)
if var thisMarkupStyle = visitor.visit(styleAttribute: styleAttribute) {
// When the Style Attribute has a converted value..
// Merge the previous MarkupStyle result with this one
thisMarkupStyle.fillIfNil(from: markupStyle)
markupStyle = thisMarkupStyle
}
}
}
// If there is a default Style
if var defaultStyle = defaultStyle {
switch policy {
case .respectMarkupStyleFromHTMLStyleAttribute:
// Take the Style Attribute MarkupStyle as the main one and then merge with the defaultStyle result
markupStyle?.fillIfNil(from: defaultStyle)
case .respectMarkupStyleFromCode:
// Take the defaultStyle as the main one and then merge with the Style Attribute MarkupStyle result
defaultStyle.fillIfNil(from: markupStyle)
markupStyle = defaultStyle
}
}
return markupStyle
}
}
The implementation corresponds to the original code in HTMLTagAttributeToMarkupStyleVisitor.swift.
We will define some default styles in MarkupStyle. In some cases, if certain Markup elements do not have the desired styles specified externally, they will use the default styles.
There are two style inheritance strategies:
- respectMarkupStyleFromCode: The default styles take precedence; then, check the Style Attributes to see if any additional styles can be applied, but ignore them if they already have a value.
- respectMarkupStyleFromHTMLStyleAttribute: The Style Attributes take precedence; then, check the default styles to see if any additional styles can be applied, but ignore them if they already have a value.
HTMLElementWithMarkupToMarkupStyleProcessor
This processor converts the normalization result into an AST & MarkupStyleComponent.
Declare a new MarkupComponent to hold the corresponding MarkupStyle:
1
2
3
4
5
6
7
8
9
10
struct MarkupStyleComponent: MarkupComponent {
typealias T = MarkupStyle
let markup: Markup
let value: MarkupStyle
init(markup: Markup, value: MarkupStyle) {
self.markup = markup
self.value = value
}
}
Simple traversal of the Markup Tree & HTMLElementMarkupComponent structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let styleAttributes: [HTMLTagStyleAttribute]
let policy: MarkupStylePolicy
func process(from: (Markup, [HTMLElementMarkupComponent])) -> [MarkupStyleComponent] {
var components: [MarkupStyleComponent] = []
let visitor = HTMLElementMarkupComponentMarkupStyleVisitor(policy: policy, components: from.1, styleAttributes: styleAttributes)
walk(markup: from.0, visitor: visitor, components: &components)
return components
}
func walk(markup: Markup, visitor: HTMLElementMarkupComponentMarkupStyleVisitor, components: inout [MarkupStyleComponent]) {
if let markupStyle = visitor.visit(markup: markup) {
components.append(.init(markup: markup, value: markupStyle))
}
for markup in markup.childMarkups {
walk(markup: markup, visitor: visitor, components: &components)
}
}
// print(components)
// [(markup: LinkMarkup, MarkupStyle(link: https://zhgchg.li, color: .blue)]
// [(markup: BoldMarkup, MarkupStyle(font: .init(weight: .bold))]
Corresponding implementation in the source code can be found in HTMLElementWithMarkupToMarkupStyleProcessor.swift.
Flow result as shown in the above image
Render — Convert To NSAttributedString
Now that we have the abstract HTML Tag tree structure and corresponding MarkupStyle, we can proceed with the final step of generating the NSAttributedString rendering result.
MarkupNSAttributedStringVisitor
This is the implementation of the MarkupVisitor protocol to convert markup into NSAttributedString.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
struct MarkupNSAttributedStringVisitor: MarkupVisitor {
typealias Result = NSAttributedString
let components: [MarkupStyleComponent]
// MarkupStyle for root/base, externally specified, for example, to set the overall font size.
let rootStyle: MarkupStyle?
func visit(_ markup: RootMarkup) -> Result {
// Traverse to the RawString object.
return collectAttributedString(markup)
}
func visit(_ markup: RawStringMarkup) -> Result {
// Return the Raw String.
// Collect all MarkupStyles in the chain.
// Apply the Style to NSAttributedString.
return applyMarkupStyle(markup.attributedString, with: collectMarkupStyle(markup))
}
func visit(_ markup: BoldMarkup) -> Result {
// Traverse to the RawString object.
return collectAttributedString(markup)
}
func visit(_ markup: LinkMarkup) -> Result {
// Traverse to the RawString object.
return collectAttributedString(markup)
}
// ...
}
private extension MarkupNSAttributedStringVisitor {
// Apply the Style to NSAttributedString.
func applyMarkupStyle(_ attributedString: NSAttributedString, with markupStyle: MarkupStyle?) -> NSAttributedString {
guard let markupStyle = markupStyle else { return attributedString }
let mutableAttributedString = NSMutableAttributedString(attributedString: attributedString)
mutableAttributedString.addAttributes(markupStyle.render(), range: NSMakeRange(0, mutableAttributedString.string.utf16.count))
return mutableAttributedString
}
func collectAttributedString(_ markup: Markup) -> NSMutableAttributedString {
// Collect from downstream.
// Root -> Bold -> String("Bold")
// \
// > String("Test")
// Result: Bold Test
// Traverse down the tree to find raw strings, recursively visit and combine them into the final NSAttributedString.
return markup.childMarkups.compactMap({ visit(markup: $0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
func collectMarkupStyle(_ markup: Markup) -> MarkupStyle? {
// Collect from upstream.
// String("Test") -> Bold -> Italic -> Root
// Result: style: Bold+Italic
// Traverse up the tree to find parent tag's markup style.
// Then inherit styles layer by layer.
var currentMarkup: Markup? = markup.parentMarkup
var currentStyle = components.value(markup: markup)
while let thisMarkup = currentMarkup {
guard let thisMarkupStyle = components.value(markup: thisMarkup) else {
currentMarkup = thisMarkup.parentMarkup
continue
}
if var thisCurrentStyle = currentStyle {
thisCurrentStyle.fillIfNil(from: thisMarkupStyle)
currentStyle = thisCurrentStyle
} else {
currentStyle = thisMarkupStyle
}
currentMarkup = thisMarkup.parentMarkup
}
if var currentStyle = currentStyle {
currentStyle.fillIfNil(from: rootStyle)
return currentStyle
} else {
return rootStyle
}
}
}
This corresponds to the MarkupNSAttributedStringVisitor.swift in the source code.
The workflow and result are depicted in the above image.
Finally, we arrive at the following:
1
2
3
4
5
6
7
8
9
10
11
Link{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d17600> font-family: \".SFUI-Regular\"; font-weight: normal; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}nk{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}Bold{
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
}
🎉🎉🎉🎉 It’s done! 🎉🎉🎉🎉
We have now completed the entire conversion process from HTML String to NSAttributedString.
Stripper — Removing HTML Tags
Stripping HTML tags is relatively simple, requiring only the following code snippet:
1
2
3
4
5
6
7
8
9
10
func attributedString(_ markup: Markup) -> NSAttributedString {
if let rawStringMarkup = markup as? RawStringMarkup {
return rawStringMarkup.attributedString
} else {
return markup.childMarkups.compactMap({ attributedString($0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
}
The corresponding implementation can be found in the MarkupStripperProcessor.swift file.
It functions similarly to Render, but specifically returns the content when RawStringMarkup is encountered.
Extend — Dynamic Extension
To extend coverage for all HTML Tags/Style Attributes, a dynamic extension approach was adopted, making it convenient to dynamically expand objects directly from the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public struct ExtendTagName: HTMLTagName {
public let string: String
public init(_ w3cHTMLTagName: WC3HTMLTagName) {
self.string = w3cHTMLTagName.rawValue
}
public init(_ string: String) {
self.string = string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
// to
final class ExtendMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
//----
public struct ExtendHTMLTagStyleAttribute: HTMLTagStyleAttribute {
public let styleName: String
public let render: ((String) -> (MarkupStyle?)) // Dynamic use of closure to change MarkupStyle
public init(styleName: String, render: @escaping ((String) -> (MarkupStyle?))) {
self.styleName = styleName
self.render = render
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagStyleAttributeVisitor {
return visitor.visit(self)
}
}
ZHTMLParserBuilder
Finally, we employ the Builder Pattern to allow external modules to swiftly construct the necessary objects for ZMarkupParser and handle Access Level Control.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
public final class ZHTMLParserBuilder {
private(set) var htmlTags: [HTMLTag] = []
private(set) var styleAttributes: [HTMLTagStyleAttribute] = []
private(set) var rootStyle: MarkupStyle?
private(set) var policy: MarkupStylePolicy = .respectMarkupStyleFromCode
public init() {
}
public static func initWithDefault() -> Self {
var builder = Self.init()
for htmlTagName in ZHTMLParserBuilder.htmlTagNames {
builder = builder.add(htmlTagName)
}
for styleAttribute in ZHTMLParserBuilder.styleAttributes {
builder = builder.add(styleAttribute)
}
return builder
}
public func set(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle?) -> Self {
return self.add(htmlTagName, withCustomStyle: markupStyle)
}
public func add(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle? = nil) -> Self {
// Only one instance of the same tagName can exist
htmlTags.removeAll { htmlTag in
return htmlTag.tagName.string == htmlTagName.string
}
htmlTags.append(HTMLTag(tagName: htmlTagName, customStyle: markupStyle))
return self
}
public func add(_ styleAttribute: HTMLTagStyleAttribute) -> Self {
styleAttributes.removeAll { thisStyleAttribute in
return thisStyleAttribute.styleName == styleAttribute.styleName
}
styleAttributes.append(styleAttribute)
return self
}
public func set(rootStyle: MarkupStyle) -> Self {
self.rootStyle = rootStyle
return self
}
public func set(policy: MarkupStylePolicy) -> Self {
self.policy = policy
return self
}
public func build() -> ZHTMLParser {
// ZHTMLParser init is only accessible internally, and external entities cannot initialize it directly.
// It can only be initialized through ZHTMLParserBuilder init.
return ZHTMLParser(htmlTags: htmlTags, styleAttributes: styleAttributes, policy: policy, rootStyle: rootStyle)
}
}
Corresponding implementation for ZHTMLParserBuilder.swift in the source code.
The ‘initWithDefault’ function is set to include all implemented HTML tag names and style attributes by default.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public extension ZHTMLParserBuilder {
static var htmlTagNames: [HTMLTagName] {
return [
A_HTMLTagName(),
B_HTMLTagName(),
BR_HTMLTagName(),
DIV_HTMLTagName(),
HR_HTMLTagName(),
I_HTMLTagName(),
LI_HTMLTagName(),
OL_HTMLTagName(),
P_HTMLTagName(),
SPAN_HTMLTagName(),
STRONG_HTMLTagName(),
U_HTMLTagName(),
UL_HTMLTagName(),
DEL_HTMLTagName(),
TR_HTMLTagName(),
TD_HTMLTagName(),
TH_HTMLTagName(),
TABLE_HTMLTagName(),
IMG_HTMLTagName(handler: nil),
// ...
]
}
}
public extension ZHTMLParserBuilder {
static var styleAttributes: [HTMLTagStyleAttribute] {
return [
ColorHTMLTagStyleAttribute(),
BackgroundColorHTMLTagStyleAttribute(),
FontSizeHTMLTagStyleAttribute(),
FontWeightHTMLTagStyleAttribute(),
LineHeightHTMLTagStyleAttribute(),
WordSpacingHTMLTagStyleAttribute(),
// ...
]
}
}
The initialization of ZHTMLParser
restricts it to being internal, meaning it cannot be directly initialized from outside and can only be initialized through ZHTMLParserBuilder
.
ZHTMLParser encapsulates the operations for rendering, selecting, and stripping:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public final class ZHTMLParser: ZMarkupParser {
let htmlTags: [HTMLTag]
let styleAttributes: [HTMLTagStyleAttribute]
let rootStyle: MarkupStyle?
internal init(...) {
}
// Retrieves link style attributes
public var linkTextAttributes: [NSAttributedString.Key: Any] {
// ...
}
public func selector(_ string: String) -> HTMLSelector {
// ...
}
public func selector(_ attributedString: NSAttributedString) -> HTMLSelector {
// ...
}
public func render(_ string: String) -> NSAttributedString {
// ...
}
// Allows rendering NSAttributedString within a node using the HTMLSelector result
public func render(_ selector: HTMLSelector) -> NSAttributedString {
// ...
}
public func render(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
public func stripper(_ string: String) -> String {
// ...
}
public func stripper(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
// ...
}
This corresponds to the implementation in the ZHTMLParser.swift source code.
UIKit Issue
When using NSAttributedString, the most common scenario is to display it in a UITextView. However, there are some considerations to be aware of:
- The link style inside a UITextView is uniformly determined by the
linkTextAttributes
property, and it won’t take into account the settings in NSAttributedString.Key. Moreover, individual link styles cannot be set separately. This is why we have theZMarkupParser.linkTextAttributes
available. - As for UILabel, there is currently no direct way to change the link style. Also, since UILabel does not have TextStorage, if you want to include NSTextAttachment images, you will need to handle it differently.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public extension UITextView {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
self.attributedText = parser.render(string)
self.linkTextAttributes = parser.linkTextAttributes
}
}
public extension UILabel {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
let attributedString = parser.render(string)
attributedString.enumerateAttribute(NSAttributedString.Key.attachment, in: NSMakeRange(0, attributedString.string.utf16.count), options: []) { (value, effectiveRange, nil) in
guard let attachment = value as? ZNSTextAttachment else {
return
}
attachment.register(self)
}
self.attributedText = attributedString
}
}
With these extensions added to UIKit, external users can simply use setHTMLString()
without worries to accomplish the binding.
Handling Complex Rendering - Item Lists
Here, we document the implementation of item lists.
Using <ol>
/ <ul>
in HTML to represent item lists:
1
2
3
4
5
6
<ul>
<li>ItemA</li>
<li>ItemB</li>
<li>ItemC</li>
//...
</ul>
Using the same parsing method mentioned earlier, we can obtain the other list items in visit(_ markup: ListItemMarkup)
and know the current list index (thanks to the conversion to AST).
1
2
3
4
func visit(_ markup: ListItemMarkup) -> Result {
let siblingListItems = markup.parentMarkup?.childMarkups.filter({ $0 is ListItemMarkup }) ?? []
let position = (siblingListItems.firstIndex(where: { $0 === markup }) ?? 0)
}
NSParagraphStyle has an NSTextList object that can be used to display list items, but customization of the blank width is not possible (personally, I find the default blank width too large). If there is any space between the item symbol and the string, it may cause the line break to occur in an unexpected place, resulting in a strange display, as shown below:
There is a possibility to achieve better results through setting headIndent, firstLineHeadIndent, and NSTextTab, but even with testing, it may not produce perfect results for longer strings with varying font sizes.
For now, we have reached an acceptable result by manually composing the item list strings and inserting them before the content.
We only utilize NSTextList.MarkerFormat to generate item list symbols, rather than directly using NSTextList.
Supported list item symbols can be found here: MarkupStyleList.swift
The final display result: ( <ol><li>
)
Handling Complex Rendering - Tables
Similar to item lists, but this time for tables.
Using <table>
in HTML to represent a table, <tr>
for table rows, and <td>/<th>
for table cells:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
Testing with the native NSAttributedString.DocumentType.html
has shown that it relies on Private macOS API NSTextBlock
to achieve the rendering of HTML tables, enabling it to display the styles and contents accurately.
However, relying on Private API is not recommended. We cannot use Private API 🥲
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
func visit(_ markup: TableColumnMarkup) -> Result {
let attributedString = collectAttributedString(markup)
let siblingColumns = markup.parentMarkup?.childMarkups.filter({ $0 is TableColumnMarkup }) ?? []
let position = (siblingColumns.firstIndex(where: { $0 === markup }) ?? 0)
// Check if a desired width is specified externally, if not, set .max to prevent string truncation
var maxLength: Int? = markup.fixedMaxLength
if maxLength == nil {
// If not specified, find the length of the first line in the same column as the maximum length
if let tableRowMarkup = markup.parentMarkup as? TableRowMarkup,
let firstTableRow = tableRowMarkup.parentMarkup?.childMarkups.first(where: { $0 is TableRowMarkup }) as? TableRowMarkup {
let firstTableRowColumns = firstTableRow.childMarkups.filter({ $0 is TableColumnMarkup })
if firstTableRowColumns.indices.contains(position) {
let firstTableRowColumnAttributedString = collectAttributedString(firstTableRowColumns[position])
let length = firstTableRowColumnAttributedString.string.utf16.count
maxLength = length
}
}
}
if let maxLength = maxLength {
// Truncate the field if it exceeds maxLength
if attributedString.string.utf16.count > maxLength {
attributedString.mutableString.setString(String(attributedString.string.prefix(maxLength)) + "...")
} else {
attributedString.mutableString.setString(attributedString.string.padding(toLength: maxLength, withPad: " ", startingAt: 0))
}
}
if position < siblingColumns.count - 1 {
// Add whitespace as spacing, external spacing width can be specified in number of whitespace characters
attributedString.append(makeString(in: markup, string: String(repeating: " ", count: markup.spacing)))
}
return attributedString
}
func visit(_ markup: TableRowMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add line break, please refer to Source Code for details
return attributedString
}
func visit(_ markup: TableMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add line break, please refer to Source Code for details
attributedString.insert(makeBreakLine(in: markup), at: 0) // Add line break, please refer to Source Code for details
return attributedString
}
**Final rendering effect as shown in the figure below:**

The implementation is not perfect, but it is acceptable.
#### Complex Rendering Item — Image
Now, let's talk about the ultimate challenge - loading remote images into NSAttributedString.
**In HTML, use `<img>` to represent an image:**
```xml
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg" width="300" height="125"/>
You can specify the desired display size using the width
/ height
HTML attributes.
Displaying images in NSAttributedString is much more complicated than expected; there is no perfect solution yet. I encountered some difficulties while working on text wrapping around images in UITextView, and this time, I have researched it extensively but still haven’t found a perfect solution.
For now, let’s ignore the issue of NSTextAttachment not being reusable and not releasing memory. We’ll focus on implementing a solution where we download the image from a remote source, place it in an NSTextAttachment, and then add it to NSAttributedString, with automatic content updates.
I have separated this functionality into a smaller project, so it can be optimized and reused in other projects:
The main idea is inspired by the series of articles Asynchronous NSTextAttachments. However, I replaced the final content update part (to display the downloaded image properly), and I added a Delegate/DataSource for external extensions.
- Declare the
ZNSTextAttachmentable
object, encapsulating the NSTextStorage object (built-in with UITextView) and UILabel itself (UILabel does not have NSTextStorage).- The
replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
function is used to implement replacing the attributedString within a specific NSRange.
- The
- The process involves wrapping the imageURL, PlaceholderImage, and desired display size in a
ZNSTextAttachment
, and initially displaying the image using a placeholder. - When the system needs to display the image on the screen, it will call the
image(forBounds…
method, and we start downloading the image data. - The DataSource is used externally to decide how to download or implement the Image Cache Policy. By default, URLSession is used to request the image data.
- Once the download is complete, a new
ZResizableNSTextAttachment
is created, and the logic to customize the image size is implemented inattachmentBounds(for…
. - The
replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
method is called to replace theZNSTextAttachment
with theZResizableNSTextAttachment
. - A
didLoad
Delegate notification is sent out, allowing external connections if needed. - Completion.
For detailed code, please refer to the Source Code repository.
In order to refresh the UI without using NSLayoutManager.invalidateLayout(forCharacterRange: range, actualCharacterRange: nil)
and NSLayoutManager.invalidateDisplay(forCharacterRange: range)
, the reason was that the UI wasn’t updating correctly. Since we already know the specific range, we can directly trigger the replacement of NSAttributedString to ensure the UI updates accurately.
The final display result is as follows:
1
2
<span style="color:red">こんにちは</span>こんにちはこんにちは <br />
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg"/>
![/assets/2724f02f6e7/1*bl65v-SVOK3H9ajR-Ksg6w.png)
Testing & Continuous Integration
For this project, in addition to writing Unit Tests for individual testing, Snapshot Tests were established to perform integration testing for an overall assessment of NSAttributedString.
The main functional logic has UnitTests, and combined with integration testing, the final Test Coverage is approximately 85%.
Snapshot Test
Directly import the framework and use:
1
2
3
4
5
6
7
8
9
10
11
12
13
import SnapshotTesting
// ...
func testShouldKeepNSAttributedString() {
let parser = ZHTMLParserBuilder.initWithDefault().build()
let textView = UITextView()
textView.frame.size.width = 390
textView.isScrollEnabled = false
textView.backgroundColor = .white
textView.setHtmlString("html string...", with: parser)
textView.layoutIfNeeded()
assertSnapshot(matching: textView, as: .image, record: false)
}
// ...
![/assets/2724f02f6e7/1*hLPeaOTOviA0jTPNOPu1hg.png)
Directly comparing the final result to the expected one ensures that the integration is functioning without any abnormalities.
Codecov Test Coverage
Integrating with Codecov.io (free for Public Repo) to evaluate Test Coverage. Simply install Codecov Github App and configure it.
After setting up the connection between Codecov and the Github Repo, you can also add codecov.yml
in the root directory of the project.
1
2
3
4
5
6
comment: # this is a top-level key
layout: "reach, diff, flags, files"
behavior: default
require_changes: false # if true: only post the comment if coverage changes
require_base: no # [yes :: must have a base report to post]
require_head: yes # [yes :: must have a head report to post]
With this configuration, every time a PR is created or reopened, the CI will automatically run, and the test result will be commented in the PR.
![/assets/2724f02f6e7/1*AcKpF4dijglahV-iVYLvvA.png)
Continuous Integration
Github Action, CI integration: ci.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
name: CI
on:
workflow_dispatch:
pull_request:
types: [opened, reopened]
push:
branches:
- main
jobs:
build:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: spm build and test
run: |
set -o pipefail
xcodebuild test -workspace ZMarkupParser.xcworkspace -testPlan ZMarkupParser -scheme ZMarkupParser -enableCodeCoverage YES -resultBundlePath './scripts/TestResult.xcresult' -destination 'platform=iOS Simulator,name=iPhone 14,OS=16.1' build test | xcpretty
- name: Codecov
uses: codecov/codecov-action@v3.1.1
with:
xcode: true
xcode_archive_path: './scripts/TestResult.xcresult'
This configuration triggers the build and test on PR opened/reopened or push to the main branch. The test coverage report will be uploaded to Codecov.
Regex
When it comes to regular expressions, every time I use them, I improve my skills. In this project, I didn’t use them extensively, but I wanted to extract paired HTML tags using regex, so I researched how to write the expression for that purpose.
Here are some cheat sheet notes on what I learned this time:
- The
?:
construct allows()
to match and group the result but does not capture and return it. e.g.,(?:https?:\/\/)?(?:www\.)?example\.com
will return the entire URLhttps://www.example.com
instead of justhttps://
andwww
. - The
.+?
construct performs a non-greedy match (finds the closest match and returns it). e.g.,<.+?>
will return<a>
and</a>
instead of the entire string<a>test</a>
. - The
(?=XYZ)
construct matches any string until the stringXYZ
appears. Note that[^XYZ]
is similar but matches any character untilX
,Y
, orZ
appears. e.g.,(?:__)(.+?(?=__))(?:__)
will matchtest
. - The
?R
construct recursively searches for values with the same rule. e.g.,\((?:[^()]|((?R)))+\)
will match(simple)
,(and(nested))
, and(nested)
in(simple) (and(nested))
.
Swift currently does not support the above constructs.
Other Useful Regex Articles:
- Swift Regular Expression Cheat Sheet
- How Regular Expressions Work -> Useful for optimizing the regex performance of this project
- Case of Infinite Server Failure Caused by Regex Error
- Regex101 - Test and explore all regex rules
Swift Package Manager & Cocoapods
This was my first time developing with SPM and Cocoapods, and it was quite interesting. SPM is genuinely convenient. However, I encountered an issue when both projects depended on the same package; building both projects simultaneously caused one of them to fail due to the package not being found.
I uploaded ZMarkupParser to Cocoapods, but I haven’t tested whether it works properly since I developed it with SPM 😝.
ChatGPT
Based on my experience using ChatGPT in development, I found it most useful for assisting in proofreading Readme files. Regarding development questions, I didn’t always get the most accurate answers, especially when asking mid-senior level questions. In those cases, ChatGPT couldn’t provide a definite or correct answer (I encountered this when asking about certain regex rules).
Moreover, I wouldn’t rely on ChatGPT to write complex code. While it can help with simple code generation for objects, it’s not capable of completing an entire tool architecture. (At least, that’s how it is currently. Copilot might be more helpful for writing code in the future)
However, ChatGPT can provide guidance on certain knowledge gaps, giving us a general direction on how to approach certain tasks. Sometimes, our understanding might be too limited to effectively search for the right solution on Google, and that’s when ChatGPT becomes helpful.
Declaration
After more than three months of research and development, I am quite exhausted. Nevertheless, I want to emphasize that this project represents the feasible results I obtained through my research. It may not be the optimal solution, and there may still be room for improvement. This project is more like a starting point, and I welcome contributions to achieve the perfect solution for Markup Language to NSAttributedString conversion. Your contributions are greatly appreciated, as many aspects need the power of the community to improve.
Contributing
At this moment (2023/03/12), I can think of several areas for improvement, and I will document them in the repository later:
- Performance and algorithm optimization: Although it is faster and more stable than the native
NSAttributedString.DocumentType.html
, there is still room for improvement. I believe the performance is not as good as XMLParser. I hope that one day, we can achieve the same performance while maintaining customization and automatic error correction. - Support for more HTML tags and style attribute conversions.
- Further optimization of ZNSTextAttachment to implement reuse and memory release, which may require studying CoreText.
- Support for Markdown parsing: As the underlying abstraction is not limited to HTML, it should be possible to create a front-end conversion from Markdown to Markup objects. Therefore, I named it ZMarkupParser instead of ZHTMLParser, hoping that one day it can also support Markdown to NSAttributedString conversion.
- Support for Any to Any conversion, e.g., HTML to Markdown, Markdown to HTML. Since we have the original AST tree (Markup objects), it is feasible to implement conversion between any Markup formats.
- Implement CSS
!important
functionality and enhance the inheritance strategy of MarkupStyle. - Strengthen HTML selector functionality, which currently only provides basic filtering.
- And many more improvements. Please feel free to open issues.
Summary
These are all the technical details and thoughts behind my development of ZMarkupParser. It has taken me nearly three months of after-work and holiday time, countless research and experimentation, and finally, writing tests, increasing test coverage, and setting up CI to achieve a somewhat presentable result. I hope this tool can solve similar problems for others, and I hope we can all work together to make this tool even better.
Currently, it is being used in our company’s iOS app on pinkoi.com, and no issues have been found. 😄
Further Reading
For any questions or suggestions, please feel free to contact me.