1. Information

Submission Window: From 12:00 AM Eastern Time on March 1, 2025, to 11:59 PM Eastern Time on March 31, 2025
Sponsor: Qualcomm Technology Inc.
Competition: 2025 Low-Power Computer Vision Challenge (2025 LPCVC)
Vision Task: Open-vocabulary Segmentation with Text Prompt
Hardware: The submitted model will be evaluated under the following platforms on Qualcomm AI Hub:

Qualcomm Snapdragon X Elite CRD

Software: Qualcomm AI Hub
Technical Support: Subscribe to the newsletter or join the Qualcomm AI Hub Slack workspace. Make sure to join channel #lpcvc for competition related notifications.
Evaluation: Details see below
Prizes:
Champion: $1,500 + a laptop with Snapdragon X Elite processors (Snapdragon X Elite Laptop)
2nd: $1,000
3rd: $500
$200 for first 5 teams with valid submissions (better than sample solution)

____________________________________________________________________________________

2. Prerequisites

Two Registrations:

Each team is required to register a team account and sign the agreement document (one team only needs to register once). We will use this registration information to manage teams and their submissions. (Updates: The registration page is available!)
Sign up an account on Qualcomm® AI Hub (top right corner). Every team member can register an account if they want. The Qualcomm® AI Hub is a powerful tool for users to compile, profile, and infer images on real mobile devices

____________________________________________________________________________________

3. Submission Format

Model Format: Participants can train their models using various libraries, such as PyTorch, ONNX, AI Model Efficiency Toolkit (AIMET) quantized models, and TensorFlow. Qualcomm AI Hub supports models trained with these libraries, and can directly compile them for mobile devices. Once the model is compiled using Qualcomm AI Hub, please follow the next two steps to submit it (compiled_job on AIHub) for evaluation and ranking.

There are two steps to complete the submission.

Step1: On Qualcomm AI Hub, share the access permission of the model that you want to submit with lowpowervision@gmail.com. It ensures that our evaluation server can access submitted models from Qualcomm AI Hub.

Step2: Fill up a submission form.

* Please refer to [Track 2 Sample Solution] for more details of submission.

Share compile job

# IMPORTANT! You must share your compile job to LPCVC organizers thus we can pull and evaluate it.

compile_job.modify_sharing(add_emails=['lowpowervision@gmail.com'])

Please note: Your models will not be evaluated and ranked unless you complete both steps (step1 & step2). Each model requires a unique submission form because you must specify the Compile Job ID in the form.

____________________________________________________________________________________

4. Evaluation Details

4.1 Data

Training data: We do not limit specific training data for this competition. Participants are free to use any accessible datasets.

Test data: The test dataset comprises approximately 1000 RGB images from around 200 classes, randomly selected from various public sources (all new data annotated by LPCVC 2025 team, no public data involved). Each image is accompanied by several text descriptions indicating various targets.

Sample data: We offer ~50 sample images with annotations for participants to explore and validate their solutions as a starting point. Please note that the sample data are not included in the final test stage dataset. The entire sample dataset will be accessible at Google Drive [https://drive.google.com/file/d/1OWw5jiaIkbnfe8FbgLTynjKiCj00TLJh/view?usp=sharing ].

4.2 Task

Model Input: An RGB image along with a text description.

Model Output: A binary mask prediction that matches the input image size (1024x1024).

4.3 Metrics

The evaluation is conducted in two stages:

Stage 1: Latency Since LPCVC aims to discuss and explore low-power computer vision solutions on edge devices, a submitted model will be considered a valid solution only if it surpasses the latency of the provided standard (~1 second / image-text) on the targeted test device (`Qualcomm Snapdragon X Elite CRD`). The sample solution can be found [Sample Solution].
Stage 2: Quality For valid solutions, we will calculate the Mean Intersection over Union (mIoU) across all test data and rank the results accordingly. No evaluation will be applied to non-valid submissions.

Score = mIoU

def compute_IoU(pred_seg, gd_seg):

I = (pred_seg & gd_seg)

U = (pred_seg | gd_seg)

return I.sum() / (U.sum() + 1e-6)

IoUs = [compute_IoU(p, g) for p, g in zip(preds, gts)]

mIoU = sum(IoUs) / len(IoUs) * 100

# NOTE: QNN/AIHub does not support output torch.bool variables, only torch.float (as provided sample solution). Thus we will apply `pred_seg = pred_mask > 0` to convert the output of the model from binary float to bool.

Leaderboard: Once the submission form is uploaded, the model (specified by the compile job id) will be evaluated and the ranking result will be available on our [Track 2 Leaderboard]. We will evaluate and update the leaderboard daily based on the latest submission in the previous day due to resource limitations. So please evaluate your solution on your own and do not use the leaderboard for tests.

4.4 Sample Solution

To better support potential participants, we will provide a detailed baseline model as a starting point for those interested in the Track-2 task. This includes the code of data processing, training and testing, compiling and inference on AIHub. Along with the code, pretrained weights will also be provided. The corresponding latency (inference time) on the test data will be used as the reference to determine if the submitted solutions are valid or not.

def forward(input_image, text_input):

# forward-pass

pred_mask = (pred_mask >= 0).float()

return pred_mask

# NOTE: we noticed that QNN/AIHub does not support output torch.bool variables, thus you may need to convert the pred_mask to torch.float type.

Input:

- input_image: shape=1x3x1024x1024

- text_input: shape=2x1x77

Output:

- pred_mask: 1024x1024 # float, binary

4.5 Data Format in Evaluation: All test data will be used to evaluate the submitted solutions online using AIHub. Thus, we prepare the test data into a specific format to fit the requirements of the AIHub platform and QNN libraries. (*mainly follows data preprocessing of XDecoder, grounding task on refCOCO)

Input	Image	Text
Data Format	Image: torch tensor float, RGB, [0, 255], shape=1x3x1024x1024	- text_input: torch tensor int, shape=2x1x77 # CLIP tokenizer output, text_input = [input_ids; attention_mask]
Explanation	Following preprocessing operations of XDecoder, we (1) first resize the longest edge of the original image to 1024, (2) then padded it to 1024x1024 square by 0s. (3) No normalization applied, so please remember to add normalization and other pre-processing inside your model	Following XDecoder preprocess, and to reduce the influence of using different tokenizers for a fair comparison, we prefixed the tokenizer as `openai-clip`text tokenizer. Check the below sample codes for more details * quoted from transformers.CLIPTokenizer: input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked.
Sample data preparation code	img = Image.open(image_path).convert("RGB") transform = transforms.Compose([ transforms.Resize(1000, max_size=1024), ]) image = transform(img) image = torch.from_numpy(np.asanyarray(image)).float().permute(2, 0, 1) images = [image] image_input = ImageList.from_tensors(images, size_divisibility=1024).tensor	# prefixed text tokenizer from transformers import CLIPTokenizer pretrained_tokenizer = 'openai/clip-vit-base-patch32' tokenizer = CLIPTokenizer.from_pretrained(pretrained_tokenizer) tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token}) # example tokenized text embedding input to the model tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt') text_input = torch.stack((tokens['input_ids'], tokens['attention_mask'])) # Shape: 2x1x77
Example data		Input text: “dog.” (*note, all input text are finished with a period “.” following the provided sample solution) Tokenized output = { 'input_ids': tensor([[49406, 2308, 1774, 15762, 269, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]], device='cuda:0', dtype=torch.int32) 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)}

5. Compile, Profile, Inference via AIHub

Please refer to the provided [Sample Solution]for details of model compile, profile, and inference via AIHub.

Important! After the close of the submission window, the TOP-10 teams on the leaderboard will be contacted to confirm which model will be their final solution used for the evaluation on the whole test data. The converted ONNX model as well as detailed evaluation scripts should also be requested in addition to the QNN model shared via AIHub.

6. References

(RefCOCO) Kazemzadeh, Sahar, Vicente Ordonez, Mark Matten, and Tamara Berg. "Referitgame: Referring to objects in photographs of natural scenes." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787-798. 2014.
(X-Decoder) Zou, Xueyan, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai et al. "Generalized decoding for pixel, image, and language." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116-15127. 2023.