Post

iOS Vision Framework x WWDC 24: Discover Swift Enhancements in the Vision Framework Session

A review of the Vision framework features & hands-on with the new Swift API in iOS 18

iOS Vision Framework x WWDC 24: Discover Swift Enhancements in the Vision Framework Session

ℹ️ℹ️ℹ️ The following content is translated by OpenAI.

Click here to view the original Chinese version. | 點此查看本文中文版


iOS Vision Framework x WWDC 24: Discover Swift Enhancements in the Vision Framework Session

A review of the Vision framework features & hands-on with the new Swift API in iOS 18

Photo by [BoliviaInteligente](https://unsplash.com/@boliviainteligente?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash){:target="_blank"}

Photo by BoliviaInteligente

Topic

The relationship between Vision Pro and hot dogs is as unrelated as it gets.

The relationship between Vision Pro and hot dogs is as unrelated as it gets.

Vision Framework

The Vision framework is Apple’s integrated image recognition framework that leverages machine learning, allowing developers to easily and quickly implement common image recognition functionalities. Launched with iOS 11.0 (2017/iPhone 8), the framework has undergone continuous iterations and optimizations, enhancing its integration with Swift Concurrency to improve execution performance. Starting from iOS 18.0, it introduces a brand new Swift Vision framework API that maximizes the benefits of Swift Concurrency.

Features of the Vision Framework

  • Built-in methods for various image recognition and dynamic tracking (31 methods available as of iOS 18)
  • On-device processing using the phone’s chip, ensuring fast and secure recognition without relying on cloud services
  • Simple and user-friendly API
  • Supported across all Apple platforms: iOS 11.0+, iPadOS 11.0+, Mac Catalyst 13.0+, macOS 10.13+, tvOS 11.0+, visionOS 1.0+
  • Released for several years (2017-present) with ongoing updates
  • Integrated Swift language features to enhance computational performance

I played around with this 6 years ago: An Introduction to Vision — Automatic Face Cropping for App Profile Pictures (Swift)

This time, I revisited it alongside the WWDC 24 Discover Swift enhancements in the Vision framework Session to explore the new Swift features again.

CoreML

Apple also has another framework called CoreML, which is a machine learning framework based on on-device processing. It allows you to train your own models for recognizing objects and documents, which can then be integrated directly into your app. Interested developers can give it a try (e.g., Real-time Article Classification, real-time Spam Detection …).

P.S.

Vision vs. VisionKit:

Vision is primarily used for image analysis tasks such as face recognition, barcode detection, and text recognition. It provides powerful APIs for processing and analyzing visual content in static images or videos.

VisionKit is specifically designed for tasks related to document scanning. It provides a scanner view controller that can be used to scan documents and generate high-quality PDFs or images.

The Vision framework cannot run on M1 models in the simulator; it can only be tested on physical devices. Running it in a simulator environment will throw a Could not create Espresso context error. I checked the official forum discussion but couldn’t find a solution.

Since I don’t have a physical iOS 18 device for testing, all execution results in this article are based on older code (pre-iOS 18). If any errors arise with the new code, please feel free to leave a comment.

WWDC 2024 — Discover Swift Enhancements in the Vision Framework

[Discover Swift enhancements in the Vision framework](https://developer.apple.com/videos/play/wwdc2024/10163/?time=45){:target="_blank"}

Discover Swift enhancements in the Vision framework

This article shares notes from the WWDC 24 session on Discover Swift enhancements in the Vision framework along with some personal experimental insights.

Introduction — Vision Framework Features

Face Recognition and Contour Detection

Text Recognition in Image Content

As of iOS 18, it supports 18 languages.

1
2
3
4
5
6
7
8
9
10
11
// List of supported languages
if #available(iOS 18.0, *) {
  print(RecognizeTextRequest().supportedRecognitionLanguages.map { "\($0.languageCode!)-\(($0.region?.identifier ?? $0.script?.identifier)!)" })
} else {
  print(try! VNRecognizeTextRequest().supportedRecognitionLanguages())
}

// The actual available recognition languages are as follows:
// The output from iOS 18 shows the following results:
// ["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT", "ar-SA", "ars-SA"]
// I did not see the Swedish language mentioned at WWDC; it's unclear if it hasn't been released yet or if it's related to device region and language settings.

Dynamic Motion Capture

  • Enables dynamic capture of people and objects
  • Gesture recognition allows for air signature functionality

What’s New in Vision? (iOS 18) — Image Scoring Feature (Quality, Memorability)

  • Can calculate a score for input images, making it easier to filter out high-quality photos
  • The scoring method includes multiple dimensions, not just image quality, but also lighting, angle, subject matter, and whether it evokes a memorable point … etc.

During WWDC, the above three images were used for explanation (under the same quality):

  • High-scoring image: good composition, lighting, and memorable points
  • Low-scoring image: lacks a subject, appears to be a casual or accidental shot
  • Utility image: technically well-taken but lacks memorable points, like stock images.

iOS ≥ 18 New API: CalculateImageAestheticsScoresRequest

1
2
3
4
5
6
7
8
let request = CalculateImageAestheticsScoresRequest()
let result = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)

// Photo score
print(result.overallScore)

// Whether it is classified as a utility image
print(result.isUtility)

What’s New in Vision? (iOS 18) — Simultaneous Detection of Body and Gesture Poses

Previously, body poses and hand poses could only be detected individually. This update allows developers to detect body poses and hand poses simultaneously, combining them into a single request and result, facilitating the development of more application functionalities.

iOS ≥ 18 New API: DetectHumanBodyPoseRequest

1
2
3
4
5
6
7
8
9
10
11
12
var request = DetectHumanBodyPoseRequest()
// Also detect hand poses
request.detectsHands = true

guard let bodyPose = try await request.perform(on: image).first else { return }

// Body Pose Joints
let bodyJoints = bodyPose.allJoints()
// Left Hand Pose Joints
let leftHandJoints = bodyPose.leftHand.allJoints()
// Right Hand Pose Joints
let rightHandJoints = bodyPose.rightHand.allJoints()

New Vision API

In this update, Apple has provided new Swift Vision API wrappers for developers. In addition to supporting the original functionalities, the focus is on enhancing Swift 6 / Swift Concurrency features, offering better performance and a more Swift-like API operation style.

Get Started with Vision

The speaker reintroduced the basic usage of the Vision framework, which Apple has packaged into 31 types (as of iOS 18) of common image recognition requests and their corresponding observation objects.

  1. Request: DetectFaceRectanglesRequest for face area recognition Result: FaceObservation The previous article “An Introduction to Vision — Automatic Face Cropping for App Profile Pictures (Swift)” used this pair of requests.
  2. Request: RecognizeTextRequest for text recognition Result: RecognizedTextObservation
  3. Request: GenerateObjectnessBasedSaliencyImageRequest for subject object recognition Result: SaliencyImageObservation

All 31 Types of Requests:

VisionRequest.

Here’s the translated text in naturalistic English while preserving the original markdown image sources:

Request PurposeObservation Description
CalculateImageAestheticsScoresRequest
Calculate the aesthetics score of an image.
AestheticsObservation
Returns the aesthetics score of the image, considering factors like composition and color.
ClassifyImageRequest
Classify the content of an image.
ClassificationObservation
Returns classification labels and confidence levels for objects or scenes in the image.
CoreMLRequest
Analyze the image using a Core ML model.
CoreMLFeatureValueObservation
Generates observations based on the output of the Core ML model.
DetectAnimalBodyPoseRequest
Detect the pose of animals in the image.
RecognizedPointsObservation
Returns the skeletal points of the animal and their locations.
DetectBarcodesRequest
Detect barcodes in the image.
BarcodeObservation
Returns barcode data and types (e.g., QR code).
DetectContoursRequest
Detect contours in the image.
ContoursObservation
Returns the detected contour lines in the image.
DetectDocumentSegmentationRequest
Detect and segment documents in the image.
RectangleObservation
Returns the rectangular boundary positions of the document.
DetectFaceCaptureQualityRequest
Evaluate the quality of face capture.
FaceObservation
Returns a quality assessment score for the face image.
DetectFaceLandmarksRequest
Detect facial landmarks.
FaceObservation
Returns detailed positions of facial landmarks (e.g., eyes, nose, etc.).
DetectFaceRectanglesRequest
Detect faces in the image.
FaceObservation
Returns the boundary box positions of the faces.
DetectHorizonRequest
Detect the horizon in the image.
HorizonObservation
Returns the angle and position of the horizon.
DetectHumanBodyPose3DRequest
Detect 3D human body poses in the image.
RecognizedPointsObservation
Returns 3D skeletal points of the human body and their spatial coordinates.
DetectHumanBodyPoseRequest
Detect human body poses in the image.
RecognizedPointsObservation
Returns skeletal points of the human body and their coordinates.
DetectHumanHandPoseRequest
Detect hand poses in the image.
RecognizedPointsObservation
Returns skeletal points of the hand and their locations.
DetectHumanRectanglesRequest
Detect humans in the image.
HumanObservation
Returns the boundary box positions of the human figures.
DetectRectanglesRequest
Detect rectangles in the image.
RectangleObservation
Returns the coordinates of the four vertices of the rectangle.
DetectTextRectanglesRequest
Detect text areas in the image.
TextObservation
Returns the positions and boundary boxes of the text areas.
DetectTrajectoriesRequest
Detect and analyze the motion trajectories of objects.
TrajectoryObservation
Returns the motion trajectory points and their time series.
GenerateAttentionBasedSaliencyImageRequest
Generate an attention-based saliency image.
SaliencyImageObservation
Returns a saliency map highlighting the most attractive areas in the image.
GenerateForegroundInstanceMaskRequest
Generate a foreground instance mask image.
InstanceMaskObservation
Returns the mask of the foreground object.
GenerateImageFeaturePrintRequest
Generate an image feature fingerprint for comparison.
FeaturePrintObservation
Returns the feature fingerprint data of the image for similarity comparison.
GenerateObjectnessBasedSaliencyImageRequest
Generate an objectness-based saliency image.
SaliencyImageObservation
Returns a saliency map highlighting the salient areas based on objectness.
GeneratePersonInstanceMaskRequest
Generate a person instance mask image.
InstanceMaskObservation
Returns the mask of the person instance.
GeneratePersonSegmentationRequest
Generate a person segmentation image.
SegmentationObservation
Returns a binary image of the person segmentation.
RecognizeAnimalsRequest
Detect and identify animals in the image.
RecognizedObjectObservation
Returns the type of animal and its confidence level.
RecognizeTextRequest
Detect and recognize text in the image.
RecognizedTextObservation
Returns the detected text content and its area location.
TrackHomographicImageRegistrationRequest
Track the homographic image registration.
ImageAlignmentObservation
Returns the homographic transformation matrix between images for alignment.
TrackObjectRequest
Track objects in the image.
DetectedObjectObservation
Returns the position and speed information of the object in the image.
TrackOpticalFlowRequest
Track optical flow in the image.
OpticalFlowObservation
Returns the optical flow vector field to describe pixel movement.
TrackRectangleRequest
Track rectangles in the image.
RectangleObservation
Returns the position, size, and rotation angle of the rectangle in the image.
TrackTranslationalImageRegistrationRequest
Track the translational image registration.
ImageAlignmentObservation
Returns the translational transformation matrix between images for alignment.
  • Prefixing with VN indicates the old API syntax (for versions prior to iOS 18).

The speaker mentioned several commonly used requests, as follows.

ClassifyImageRequest

Recognize the input image and obtain classification labels and confidence levels.

[Travelogue] 2024 Second Visit to Kyushu 9-Day Free Trip, Entering via Busan → Hakata Cruise

[Travelogue] 2024 Second Visit to Kyushu 9-Day Free Trip, Entering via Busan → Hakata Cruise

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = ClassifyImageRequest()
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)
            observations.forEach {
                observation in
                print("\(observation.identifier): \(observation.confidence)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old syntax
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNClassificationObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("\(observation.identifier): \(observation.confidence)")
        }
    }

    let request = VNClassifyImageRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*3_jdrLurFuUfNdW4BJaRww.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  outdoor: 0.75392926
  sky: 0.75392926
  blue_sky: 0.7519531
  machine: 0.6958008
  cloudy: 0.26538086
  structure: 0.15728651
  sign: 0.14224191
  fence: 0.118652344
  banner: 0.0793457
  material: 0.075975396
  plant: 0.054406323
  foliage: 0.05029297
  light: 0.048126098
  lamppost: 0.048095703
  billboards: 0.040039062
  art: 0.03977703
  branch: 0.03930664
  decoration: 0.036868922
  flag: 0.036865234
.... and more

RecognizeTextRequest

Recognize the text content in the image (a.k.a. image-to-text).

[[Travelogue] 2023 Tokyo 5-Day Free Trip](../9da2c51fa4f2/)

[Travelogue] 2023 Tokyo 5-Day Free Trip

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [.init(identifier: "ja-JP"), .init(identifier: "en-US")] // Specify language code, e.g., Traditional Chinese
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!)
            observations.forEach {
                observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old syntax
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach {
            observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let request = VNRecognizeTextRequest(completionHandler: completionHandler)
    request.recognitionLevel = .accurate
    request.recognitionLanguages = ["ja-JP", "en-US"] // Specify language code, e.g., Traditional Chinese
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
LE LABO 青山店
TEL:03-6419-7167
*Thank you for your purchase*
No: 21347
Date: 2023/06/10 14:14:57
Person in charge:
1690370
Register: 008A 1
Product Name
Tax-inclusive Price Quantity Total Price
Kaiak 10 EDP FB 15ML
J1P7010000S
16,800
16,800
Another 13 EDP FB 15ML
J1PJ010000S
10,700
10,700
Lip Balm 15ML
JOWC010000S
2,000
1
Total Amount
(Tax included)
CARD
2,000
3 items purchased
29,500
0
29,500
29,500

DetectBarcodesRequest

Detect barcodes and QR code data in the image.

Recommended by locals in Thailand: Goose Brand Cooling Balm

Recommended by locals in Thailand: Goose Brand Cooling Balm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
let filePath = Bundle.main.path(forResource: "IMG_6777", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = DetectBarcodesRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach {
                observation in
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old syntax
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
2
3
4
5
6
7
8
Payload: 8859126000911
Symbology: VNBarcodeSymbologyEAN13
Payload: https://lin.ee/hGynbVM
Symbology: VNBarcodeSymbologyQR
Payload: http://www.hongthaipanich.com/
Symbology: VNBarcodeSymbologyQR
Payload: https://www.facebook.com/qr?id=100063856061714
Symbology: VNBarcodeSymbologyQR

RecognizeAnimalsRequest

Identify animals in the image along with their confidence levels.

[meme Source](https://www.redbubble.com/i/canvas-print/Funny-AI-Woman-yelling-at-a-cat-meme-design-Machine-learning-by-omolog/43039298.5Y5V7){:target="_blank"}

meme Source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
let filePath = Bundle.main.path(forResource: "IMG_5026", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = RecognizeAnimalsRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach { observation in
                let labels = observation.labels
                labels.forEach { label in
                    print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
                }
            }
        } catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old method
    let completionHandler: VNRequestCompletionHandler = { request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedObjectObservation] else {
            return
        }
        observations.forEach { observation in
            let labels = observation.labels
            labels.forEach { label in
                print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
            }
        }
    }

    let request = VNRecognizeAnimalsRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        } catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
Detected animal: Cat with confidence: 0.77245045

Others:

  • Detect humans in images: DetectHumanRectanglesRequest
  • Detect poses of humans and animals (both 3D and 2D): DetectAnimalBodyPoseRequest, DetectHumanBodyPose3DRequest, DetectHumanBodyPoseRequest, DetectHumanHandPoseRequest
  • Detect and track the motion trajectories of objects (in videos and animations): DetectTrajectoriesRequest, TrackObjectRequest, TrackRectangleRequest

iOS ≥ 18 Update Highlights:

1
2
3
4
VN*Request -> *Request (e.g. VNDetectBarcodesRequest -> DetectBarcodesRequest)
VN*Observation -> *Observation (e.g. VNRecognizedObjectObservation -> RecognizedObjectObservation)
VNRequestCompletionHandler -> async/await
VNImageRequestHandler.perform([VN*Request]) -> *Request.perform()

WWDC Example

The official WWDC video uses a supermarket product scanner as an example.

Most products have barcodes that can be scanned.

We can obtain the location of the barcode from observation.boundingBox, but unlike the common UIView coordinate system, the BoundingBox relative position starts from the bottom left corner, with values ranging from 0 to 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
let filePath = Bundle.main.path(forResource: "IMG_6785", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = DetectBarcodesRequest()
    request.symbologies = [.ean13] // Specify to scan only EAN13 Barcode for better performance
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            if let observation = observations.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer for marking
                    let colorLayer = CALayer()
                    // iOS >=18 new coordinate conversion API toImageCoordinates
                    // Not tested; actual calculations may require adjustments for ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        } catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old method
    let completionHandler: VNRequestCompletionHandler = { request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer for marking
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    request.symbologies = [.ean13] // Specify to scan only EAN13 Barcode for better performance
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        } catch {
            print("Request failed: \(error)")
        }
    }
}

iOS ≥ 18 Update Highlights:

// iOS >=18 new coordinate conversion API toImageCoordinates
observation.boundingBox.toImageCoordinates(CGSize, origin: .upperLeft)
// https://developer.apple.com/documentation/vision/normalizedpoint/toimagecoordinates(from:imagesize:origin:)

Helper:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Generated by ChatGPT 4o
// Since the photo in the ImageView is set with ContentMode = AspectFit
// We need to calculate the vertical displacement caused by the fit
func convertBoundingBox(_ boundingBox: CGRect, to view: UIImageView) -> CGRect {
    guard let image = view.image else {
        return .zero
    }

    let imageSize = image.size
    let viewSize = view.bounds.size
    let imageRatio = imageSize.width / imageSize.height
    let viewRatio = viewSize.width / viewSize.height
    var scaleFactor: CGFloat
    var offsetX: CGFloat = 0
    var offsetY: CGFloat = 0
    if imageRatio > viewRatio {
        // Image fits in width
        scaleFactor = viewSize.width / imageSize.width
        offsetY = (viewSize.height - imageSize.height * scaleFactor) / 2
    } else {
        // Image fits in height
        scaleFactor = viewSize.height / imageSize.height
        offsetX = (viewSize.width - imageSize.width * scaleFactor) / 2
    }

    let x = boundingBox.minX * imageSize.width * scaleFactor + offsetX
    let y = (1 - boundingBox.maxY) * imageSize.height * scaleFactor + offsetY
    let width = boundingBox.width * imageSize.width * scaleFactor
    let height = boundingBox.height * imageSize.height * scaleFactor
    return CGRect(x: x, y: y, width: width, height: height)
}

Output Results

1
2
3
BoundingBox: (0.5295758928571429, 0.21408638121589782, 0.0943080357142857, 0.21254415360708087)
Payload: 4710018183805
Symbology: VNBarcodeSymbologyEAN13

Some products do not have barcodes, such as bulk fruits that only have product labels.

Therefore, our scanner also needs to support scanning plain text labels.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
let filePath = Bundle.main.path(forResource: "apple", ofType: "jpg")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // Specify to scan only EAN13 Barcode for better performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        do {
            let handler = ImageRequestHandler(fileURL)
            // Parameter pack syntax; we must wait for all requests to finish before using their results.
            // let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
            let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)
            if let observation = barcodesObservation.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer for marking
                    let colorLayer = CALayer()
                    // iOS >=18 new coordinate conversion API toImageCoordinates
                    // Not tested; actual calculations may require adjustments for ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
            textObservation.forEach { observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        } catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old method
    let barcodesCompletionHandler: VNRequestCompletionHandler = { request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer for marking
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let textCompletionHandler: VNRequestCompletionHandler = { request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach { observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let barcodesRequest = VNDetectBarcodesRequest(completionHandler: barcodesCompletionHandler)
    barcodesRequest.symbologies = [.ean13] // Specify to scan only EAN13 Barcode for better performance
    let textRequest = VNRecognizeTextRequest(completionHandler: textCompletionHandler)
    textRequest.recognitionLevel = .accurate
    textRequest.recognitionLanguages = ["en-US"]
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([barcodesRequest, textRequest])
        } catch {
            print("Request failed: \(error)")
        }
    }
}

Output Results:

1
2
3
4
94128s
ORGANIC
Pink Lady®
Produce of USh

iOS ≥ 18 Update Highlights:

1
2
3
4
let handler = ImageRequestHandler(fileURL)
// Parameter pack syntax; we must wait for all requests to finish before using their results.
// let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)

iOS ≥ 18 performAll() Method

The previous perform(barcodesRequest, textRequest) method requires waiting for both requests to complete before proceeding; starting with iOS 18, a new performAll() method is provided, allowing for streaming responses. You can handle results as soon as one of the requests is completed, such as responding immediately upon scanning a barcode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // Specify to scan only EAN13 Barcode for better performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        let handler = ImageRequestHandler(fileURL)
        let observation = handler.performAll([barcodesRequest, textRequest] as [any VisionRequest])
        for try await result in observation {
            switch result {
                case .detectBarcodes(_, let barcodesObservation):
                    if let observation = barcodesObservation.first {
                        DispatchQueue.main.async {
                            self.infoLabel.text = observation.payloadString
                            // Color layer for marking
                            let colorLayer = CALayer()
                            // iOS >=18 new coordinate conversion API toImageCoordinates
                            // Not tested; actual calculations may require adjustments for ContentMode = AspectFit:
                            colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                            colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                            self.baseImageView.layer.addSublayer(colorLayer)
                        }
                        print("BoundingBox: \(observation.boundingBox.cgRect)")
                        print("Payload: \(observation.payloadString ?? "No payload")")
                        print("Symbology: \(observation.symbology)")
                    }
                case .recognizeText(_, let textObservation):
                    textObservation.forEach { observation in
                        let topCandidate = observation.topCandidates(1).first
                        print(topCandidate?.string ?? "No text recognized")
                    }
                default:
                    print("Unrecognized result: \(result)")
            }
        }
    }
}

Optimize with Swift Concurrency

Assuming we have a list of image thumbnails, where each image needs to be automatically cropped to focus on the main subject, we can effectively utilize Swift Concurrency to enhance loading efficiency.

Original Implementation

1
2
3
4
5
6
7
8
9
10
11
func generateThumbnail(url: URL) async throws -> UIImage {
  let request = GenerateAttentionBasedSaliencyImageRequest()
  let saliencyObservation = try await request.perform(on: url)
  return cropImage(url, to: saliencyObservation.salientObjects)
}
    
func generateAllThumbnails() async throws {
  for image in images {
    image.thumbnail = try await generateThumbnail(url: image.url)
  }
}

This approach processes one image at a time, resulting in slow efficiency and performance.

Optimization (1) — TaskGroup Concurrency

1
2
3
4
5
6
7
func generateAllThumbnails() async throws {
  try await withThrowingDiscardingTaskGroup { taskGroup in
    for image in images {
      image.thumbnail = try await generateThumbnail(url: image.url)
    }
  }
}

Here, each task is added to a TaskGroup for concurrent execution.

Note: Image recognition and cropping operations are very memory-intensive. If too many concurrent tasks are initiated without restraint, it may lead to user lag or out-of-memory (OOM) crashes.

Optimization (2) — TaskGroup Concurrency + Limiting Concurrent Tasks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
func generateAllThumbnails() async throws {
    try await withThrowingDiscardingTaskGroup { taskGroup in
        // Limit the maximum number of concurrent tasks to 5
        let maxImageTasks = min(5, images.count)
        // Initially fill 5 tasks
        for index in 0..<maxImageTasks {
            taskGroup.addTask {
                images[index].thumbnail = try await generateThumbnail(url: images[index].url)
            }
        }
        var nextIndex = maxImageTasks
        for try await _ in taskGroup {
            // When a task in the taskGroup completes...
            // Check if we've reached the end of the index
            if nextIndex < images.count {
                let image = images[nextIndex]
                // Continue adding tasks (keeping the limit at 5)
                taskGroup.addTask {
                    image.thumbnail = try await generateThumbnail(url: image.url)
                }
                nextIndex += 1
            }
        }
    }
}

Update an Existing Vision App

  1. Vision will remove CPU and GPU support for certain requests on devices equipped with a Neural Engine. On these devices, the Neural Engine is the optimal choice for performance. You can check this using the supportedComputeDevices() API.
  2. Remove all VN prefixes: VNXXRequest, VNXXXObservationRequest, Observation.
  3. Use async/await instead of the original VNRequestCompletionHandler.
  4. Directly use *Request.perform() instead of the original VNImageRequestHandler.perform([VN*Request]).

Wrap-Up

  • APIs designed with new Swift language features.
  • New functionalities and methods are Swift-only, available for iOS ≥ 18.
  • New image scoring features, body and hand motion tracking.

Thanks!

KKday Recruitment

👉👉👉 This sharing session originates from the weekly technical sharing activities of the KKday App Team. The team is currently actively recruiting Senior iOS Engineers. Interested candidates are welcome to submit their resumes. 👈👈👈

References

Discover Swift enhancements in the Vision framework

The Vision Framework API has been redesigned to leverage modern Swift features like concurrency, making it easier and faster to integrate a wide array of Vision algorithms into your app. We’ll tour the updated API and share sample code, along with best practices, to help you get the benefits of this framework with less coding effort. We’ll also demonstrate two new features: image aesthetics and holistic body pose.

Chapters

Vision framework Apple Developer Documentation

If you have any questions or feedback, feel free to contact me.


This article was first published on Medium ➡️ Click Here

Automatically converted and synchronized using ZMediumToMarkdown and Medium-to-jekyll-starter.

Improve this page on Github.

Buy me a beer

730 Total Views
Last Statistics Date: 2025-03-25 | 434 Views on Medium.
This post is licensed under CC BY 4.0 by the author.