1. Information
____________________________________________________________________________________
2. Prerequisites
Two Registrations:
____________________________________________________________________________________
3. Submission Format
Model Format: Participants can train their models using various libraries, such as PyTorch, ONNX, AI Model Efficiency Toolkit (AIMET) quantized models, and TensorFlow. Qualcomm AI Hub supports models trained with these libraries, and can directly compile them for mobile devices. Once the model is compiled using Qualcomm AI Hub, please follow the next two steps to submit it (compiled_job on AIHub) for evaluation and ranking.
There are two steps to complete the submission.
Step1: On Qualcomm AI Hub, share the access permission of the model that you want to submit with lowpowervision@gmail.com. It ensures that our evaluation server can access submitted models from Qualcomm AI Hub.
Step2: Fill up a submission form.
* Please refer to [Track 2 Sample Solution] for more details of submission.
Share compile job | # IMPORTANT! You must share your compile job to LPCVC organizers thus we can pull and evaluate it. compile_job.modify_sharing(add_emails=['lowpowervision@gmail.com']) |
Please note: Your models will not be evaluated and ranked unless you complete both steps (step1 & step2). Each model requires a unique submission form because you must specify the Compile Job ID in the form.
____________________________________________________________________________________
4. Evaluation Details
4.1 Data
Training data: We do not limit specific training data for this competition. Participants are free to use any accessible datasets.
Test data: The test dataset comprises approximately 1000 RGB images from around 200 classes, randomly selected from various public sources (all new data annotated by LPCVC 2025 team, no public data involved). Each image is accompanied by several text descriptions indicating various targets.
Sample data: We offer ~50 sample images with annotations for participants to explore and validate their solutions as a starting point. Please note that the sample data are not included in the final test stage dataset. The entire sample dataset will be accessible at Google Drive [https://drive.google.com/file/d/1OWw5jiaIkbnfe8FbgLTynjKiCj00TLJh/view?usp=sharing ].
4.2 Task
Model Input: An RGB image along with a text description.
Model Output: A binary mask prediction that matches the input image size (1024x1024).
4.3 Metrics
The evaluation is conducted in two stages:
Score = mIoU | def compute_IoU(pred_seg, gd_seg): I = (pred_seg & gd_seg) U = (pred_seg | gd_seg) return I.sum() / (U.sum() + 1e-6) IoUs = [compute_IoU(p, g) for p, g in zip(preds, gts)] mIoU = sum(IoUs) / len(IoUs) * 100 # NOTE: QNN/AIHub does not support output torch.bool variables, only torch.float (as provided sample solution). Thus we will apply `pred_seg = pred_mask > 0` to convert the output of the model from binary float to bool. |
4.4 Sample Solution
To better support potential participants, we will provide a detailed baseline model as a starting point for those interested in the Track-2 task. This includes the code of data processing, training and testing, compiling and inference on AIHub. Along with the code, pretrained weights will also be provided. The corresponding latency (inference time) on the test data will be used as the reference to determine if the submitted solutions are valid or not.
def forward(input_image, text_input): # forward-pass pred_mask = (pred_mask >= 0).float() return pred_mask # NOTE: we noticed that QNN/AIHub does not support output torch.bool variables, thus you may need to convert the pred_mask to torch.float type. | Input: - input_image: shape=1x3x1024x1024 - text_input: shape=2x1x77 Output: - pred_mask: 1024x1024 # float, binary |
4.5 Data Format in Evaluation: All test data will be used to evaluate the submitted solutions online using AIHub. Thus, we prepare the test data into a specific format to fit the requirements of the AIHub platform and QNN libraries. (*mainly follows data preprocessing of XDecoder, grounding task on refCOCO)
Input | Image | Text |
Data Format |
| - text_input: torch tensor int, shape=2x1x77 # CLIP tokenizer output, text_input = [input_ids; attention_mask] |
Explanation | Following preprocessing operations of XDecoder, we (1) first resize the longest edge of the original image to 1024, (2) then padded it to 1024x1024 square by 0s. (3) No normalization applied, so please remember to add normalization and other pre-processing inside your model | Following XDecoder preprocess, and to reduce the influence of using different tokenizers for a fair comparison, we prefixed the tokenizer as `openai-clip`text tokenizer. Check the below sample codes for more details * quoted from transformers.CLIPTokenizer: input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. |
Sample data preparation code | img = Image.open(image_path).convert("RGB") transform = transforms.Compose([ transforms.Resize(1000, max_size=1024), ]) image = transform(img) image = torch.from_numpy(np.asanyarray(image)).float().permute(2, 0, 1) images = [image] image_input = ImageList.from_tensors(images, size_divisibility=1024).tensor | # prefixed text tokenizer from transformers import CLIPTokenizer pretrained_tokenizer = 'openai/clip-vit-base-patch32' tokenizer = CLIPTokenizer.from_pretrained(pretrained_tokenizer) tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token}) # example tokenized text embedding input to the model tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt') text_input = torch.stack((tokens['input_ids'], tokens['attention_mask'])) # Shape: 2x1x77 |
Example data | Input text: “dog.” (*note, all input text are finished with a period “.” following the provided sample solution) Tokenized output = { 'input_ids': tensor([[49406, 2308, 1774, 15762, 269, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]], device='cuda:0', dtype=torch.int32) 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)} |
5. Compile, Profile, Inference via AIHub
Please refer to the provided [Sample Solution]for details of model compile, profile, and inference via AIHub.
Important! After the close of the submission window, the TOP-10 teams on the leaderboard will be contacted to confirm which model will be their final solution used for the evaluation on the whole test data. The converted ONNX model as well as detailed evaluation scripts should also be requested in addition to the QNN model shared via AIHub.
6. References