1. Information

Submission Window: From 12:00 AM Eastern Time on March 1, 2025, to 11:59 PM Eastern Time on March 31, 2025
Sponsor: Qualcomm Technology Inc.
Competition: 2025 Low-Power Computer Vision Challenge (2025 LPCVC)
Vision Task: Monocular Relative Depth Estimation
Hardware: The submitted model will be evaluated under the following platforms on Qualcomm AI Hub:

Snapdragon 8 Elite QRD

Software: Qualcomm AI Hub
Technical Support: Subscribe to the newsletter or join the Qualcomm AI Hub Slack workspace. Make sure to join channel #lpcvc for competition related notifications.
Evaluation: Details see below
Prizes:
Champion: $1,500 + a laptop with Snapdragon X Elite processors (Snapdragon X Elite Laptop)
2nd: $1,000
3rd: $500
$200 for first 5 teams with valid submissions (better than sample solution)

2. Prerequisites

Two Registrations:

Each team is required to register a team account and sign the agreement document (one team only needs to register once). We will use this registration information to manage teams and their submissions.
Sign up an account on Qualcomm® AI Hub (top right corner). Every team member can register an account if they want. The Qualcomm® AI Hub is a powerful tool for us to compile, profile, and infer images on real mobile devices.

Please refer to our sample solution for track3 for more details of participation.

3. Submission Format

Model Format: Participants can train their models using various libraries, such as PyTorch, ONNX, AI Model Efficiency Toolkit (AIMET) quantized models, and TensorFlow. Qualcomm AI Hub supports models trained with these libraries, and can directly compile them for mobile devices. Once the model is compiled using Qualcomm AI Hub, please follow the next two steps to submit it for evaluation and ranking.

There are two steps to complete the submission.

Step1: On Qualcomm AI Hub, share the access permission of the model that you want to submit with lowpowervision@gmail.com. It ensures that our evaluation server can access submitted models from Qualcomm AI Hub.

Step2: Fill up a submission form.

Please refer to our sample solution for track3 for more details of submission.

Please note: Our models will not be evaluated and ranked unless we complete both steps (step1 & step2). Each model requires a unique submission form because we will specify the Compile Job ID in the form.

4. Evaluation Details

4.1 Data

The evaluation dataset comprises 2,000 RGB images captured in various indoor and outdoor scenes under different lighting conditions (normal and low light) using a range of mobile devices. As detailed in the table below, the dataset includes 500 indoor images with normal light, 500 indoor images with low light, 500 outdoor images with normal light, and 500 outdoor images with low light. 10% of these evaluation RGB images will be publicly accessible:
https://drive.google.com/drive/folders/1WnQC0Xdz3lI6F8E6zHAc6oOxw4iR8JoM?usp=sharing

Scene \ Lighting	Normal Light	Low Light
Indoor	500 images	500 images
Outdoor	500 images	500 images

Each image is accompanied by a corresponding Depth Map and Confidence Map. The Depth Map indicates the depth of each pixel in the RGB image, while the Confidence Map shows the confidence level of the corresponding pixel in the Depth Map. The following picture displays some samples (indoor normal light, indoor low light, outdoor low light). Each row contains the RGB image, Visualized Depth Map, and Visualized Confidence Map from left to right.

4.2 Model input and output

Model Input: During model evaluation, only the RGB images will be fed into the submitted model. The Depth Map and Confidence Map will not be accessible. Therefore, the submitted model should only take one image as input. All RGB images are in VGA resolution (640x480), so the input tensor shape will be (Batch, Channels, Height, Width) = (Batch, 3, 480, 640) in the PyTorch format. All images will be loaded as RGB channel order and ranging from [0, 255] in float. No input normalization will be applied, so each submitted model should include normalization operations, such as (image-mean)/std or image/255, at the beginning of the model if needed.
Model Output: The submitted model is expected to predict a relative depth (ranging from 0 to 1, from near to far) based on its input. The model should produce a single output with one channel, and the expected output tensor shape is (Batch, 1, Height, Width) = (Batch, 1, 480, 640).

4.3 Metrics

Given that this challenge focuses on low-power computer vision tasks, we propose a two-stage evaluation strategy:

Stage 1: All submissions will first be profiled on real mobile devices (Snapdragon 8 Elite QRD). A submission will be considered valid if its inference speed on the target mobile device is faster than 34ms (approximately 30fps).
Stage 2: For valid solutions, we will calculate their accuracy scores and rank them based on a single score or a weighted combination of multiple scores. Specifically, a valid solution will infer all 2,000 evaluation images on the target mobile device. Following existing work [1], [2], [3], [4], we will calculate image-based metrics (MAE, RMSE, AbsRel, etc.) and point cloud-based metrics (F-Score, IoU, Chamfer distance, etc.). For non-valid solutions, neither accuracy scores will be calculated nor rankings assigned.

Note:

Unlike other challenges or benchmarks, we will only consider predicted depth at locations with high confidence scores in the Confidence Map. For example, in the visualized samples above, predicted depth in the black areas of the Confidence Map (indicating low confidence) will not be considered when calculating accuracy scores. This approach aims to minimize the negative impact of errors in the Depth Ground Truth on the evaluation scores.
The predicted depth map (ranging from 0 to 1) will be rescaled based on the median value of the Ground Truth Depth before calculating accuracy scores.

4.4 Leaderboard

Once the submission form is uploaded, the model (specified by the compile job id) will be evaluated and the ranking result will be available on our leaderboard.

References

Jaime Spencer, C. Stella Qian, Chris Russell, Simon Hadfield, Erich Graf, Wendy Adams, Andrew J. Schofield et al. "The monocular depth estimation challenge." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 623-632. 2023.
Filippo Aleotti, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. "Generative adversarial networks for unsupervised monocular depth prediction." In Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0-0. 2018.
Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden et al. "The third monocular depth estimation challenge." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1-14. 2024.
Evin Pınar Örnek, Shristi Mudgal, Johanna Wald, Yida Wang, Nassir Navab, and Federico Tombari. "From 2D to 3D: Re-thinking benchmarking of monocular depth prediction." arXiv preprint arXiv:2203.08122. 2022.