Document Proxy

1. Information

Submission Window: From 12:00 AM Eastern Time on March 1, 2026, to 11:59 PM Eastern Time on April 30, 2026
Sponsor: Qualcomm Technology Inc.
Competition: 2026 Low-Power Computer Vision Challenge (2026 LPCVC)
Vision Task: Open-World Image-to-Text Retrieval
Hardware: The submitted model will be evaluated under the following platforms on Qualcomm AI Hub:

Qualcomm XR2 Gen 2, [--device "XR2 Gen 2 (Proxy)"] on AIHub
(Important! We have transitioned Track 1 evaluation from the Snapdragon mobile platform to the XR platform to more effectively leverage Qualcomm’s broad and diverse platform experience.)

Software: Qualcomm AI Hub
Technical Support: Subscribe to the newsletter or join the Qualcomm AI Hub Slack workspace. Make sure to join channel #lpcvc for competition related notifications.
Evaluation: Details see below
Prizes:
Champion: $6,000
2nd: $3,000
3rd: $1,000
$300 for the first 5 teams with valid submissions (better than sample solution)

____________________________________________________________________________________

2. Prerequisites

Two Registrations:

Each team is required to register a team account and sign the agreement document (one team only needs to register once). We will use this registration information to manage teams and their submissions. (Updates: The registration page is available!)
Sign up for an account on Qualcomm® AI Hub (top right corner). Every team member can register an account if they want. The Qualcomm® AI Hub is a powerful tool for users to compile, profile, and infer images on real mobile devices

____________________________________________________________________________________

3. Submission Format

Model Format: Participants can train their models using various libraries, such as PyTorch, ONNX, AI Model Efficiency Toolkit (AIMET) quantized models, and TensorFlow. Qualcomm AI Hub supports models trained with these libraries, and can directly compile them for mobile devices. Once the model is compiled using Qualcomm AI Hub, please follow the next two steps to submit it (compiled_job on AIHub) for evaluation and ranking.

There are two steps to complete the submission.

Step1: On Qualcomm AI Hub, share the access permission of the model that you want to submit with lowpowervision@gmail.com. It ensures that our evaluation server can access submitted models from Qualcomm AI Hub.

Step2: Fill up a submission form.

* Please refer to [Track 1 Sample Solution] for more details of submission.

Share compile job

# IMPORTANT! You must share your compile job to LPCVC organizers thus we can pull and evaluate it.

compile_job.modify_sharing(add_emails=['lowpowervision@gmail.com'])

Please note: Your models will not be evaluated and ranked unless you complete both steps (step1 & step2). Each model requires a unique submission form because you must specify the Compile Job ID in the form.

____________________________________________________________________________________

4. Evaluation Details

4.1 Data

Training data: We do not limit specific training data for this competition. Participants are free to use any accessible datasets.

Test data: The test dataset comprises approximately ~300 RGB images involving arbitrary objects. The images are randomly selected from various public sources (all new data annotated by LPCVC team, no public data involved). Each image is accompanied by several text descriptions indicating various targets or the whole image.

Sample data: We offer ~50 sample images with textual annotations for participants to explore and validate their solutions as a starting point. Please note that the sample data are not included in the final test stage dataset. The entire sample dataset will be accessible at Google Drive [GoogleDrive].

4.2 Task

Goal: Retrieve the most relevant text descriptions for a given image from a candidate pool.

Model Input: An RGB image, and all textual descriptions in the dataset

Model Output: Image and text embeddings (will be used to calculate similarity then ranking)

* Check the provided [Sample Solution] for detailed input and output data format for the evaluation pipeline

4.3 Metrics

The evaluation is conducted in two stages:

Stage 1: Latency Since LPCVC aims to discuss and explore low-power computer vision solutions on edge devices, a submitted model will be considered a valid solution only if it runs faster than 35ms (image and text encoder combined) on the targeted test device.
Stage 2: Quality For valid solutions, the performance will be measured using Recall@Top10, . No evaluation will be applied to non-valid submissions. The leaderboard ranking will be decided based on accuracy only. The final awards will be decided among the top 5 teams’ submissions based on both accuracy and latency (details to be released).

Leaderboard: Once the submission form is uploaded, the model (specified by the compile job id) will be evaluated and the ranking result will be available on our [Track 1 Leaderboard]. We will evaluate and update the leaderboard daily based on the latest submission in the previous day due to resource limitations. So please evaluate your solution on your own and do not use the leaderboard for tests.

4.4 Sample Solution

We accept [OpenAI-CLIP] as a sample solution to better support potential participants. The corresponding latency (inference time) on the test data will be used as the reference to determine if the submitted solutions are valid or not.

Text Encoder
Image Encoder

4.5 Data Format in Evaluation: All test data will be used to evaluate the submitted solutions online using AIHub. Thus, we prepare the test data into a specific format to fit the requirements of the AIHub platform and QNN libraries. (*mainly follows OpenAI-CLIP setting)

Input	Image	Text
Data Format	Image: RGB, array, float32, shape=1x3x224x224	- text_input: array, int32, shape=1x77 # CLIP tokenizer used as `openai/clip-vit-base-patch32`
Explanation	Following preprocessing as the sample solution, we (1) first, resize the original image to 224x224, then normalize the pixel values by dividing by 255 (2) second, reshape from HxWxC to CxHxW, and add a batch dimension (3) No normalization applied, so please remember to add normalization and other pre-processing inside your model if needed (4) One pre-processed image ends up as a float32 array of shape (1, 3, 224, 224)	Following preprocess in the sample solution, and to reduce the influence of using different tokenizers for a fair comparison, we prefixed the tokenizer as `openai-clip`text tokenizer. Check the below sample codes for more details * quoted from transformers.CLIPTokenizer: input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Sample data preparation code	To be updated…	# prefixed text tokenizer from transformers import CLIPTokenizer pretrained_tokenizer = 'openai/clip-vit-base-patch32' tokenizer = CLIPTokenizer.from_pretrained(pretrained_tokenizer) tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token}) # example tokenized text embedding input to the model tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt') text_input = tokens['input_ids'] # Shape: 2x1x77
Example data		Input text: “white soccer ball” Tokenzied output: 'input_ids' = [[49406 1579 4233 1069 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407 49407]]

____________________________________________________________________________________

5. Compile, Profile, Inference via AIHub

Please refer to the provided [Sample Solution]for details of model compile, profile, and inference via AIHub.

Important! After the close of the submission window, the TOP-5 teams on the leaderboard will be contacted to confirm which model will be their final solution used for the evaluation on the whole test data. The converted ONNX model (if available) as well as detailed evaluation scripts should also be requested in addition to the QNN model shared via AIHub.

____________________________________________________________________________________

6. References

(CLIP) Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning transferable visual models from natural language supervision." In International conference on machine learning, pp. 8748-8763. PmLR, 2021.
(Scaling-Laws-OpenCLIP) Cherti, Mehdi, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. "Reproducible scaling laws for contrastive language-image learning." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2818-2829. 2023.