1. Information
____________________________________________________________________________________
2. Prerequisites
Two Registrations:
____________________________________________________________________________________
3. Submission Format
Model Format: Participants can train their models using various libraries, such as PyTorch, ONNX, AI Model Efficiency Toolkit (AIMET) quantized models, and TensorFlow. Qualcomm AI Hub supports models trained with these libraries, and can directly compile them for mobile devices. Once the model is compiled using Qualcomm AI Hub, please follow the next two steps to submit it (compiled_job on AIHub) for evaluation and ranking.
There are two steps to complete the submission.
Step1: On Qualcomm AI Hub, share the access permission of the model that you want to submit with lowpowervision@gmail.com. It ensures that our evaluation server can access submitted models from Qualcomm AI Hub.
Step2: Fill up a submission form.
* Please refer to [Track 1 Sample Solution] for more details of submission. (TO BE UPDATED)
Share compile job | # IMPORTANT! You must share your compile job to LPCVC organizers thus we can pull and evaluate it. compile_job.modify_sharing(add_emails=['lowpowervision@gmail.com']) |
Please note: Your models will not be evaluated and ranked unless you complete both steps (step1 & step2). Each model requires a unique submission form because you must specify the Compile Job ID in the form.
____________________________________________________________________________________
4. Evaluation Details
4.1 Data
Training data: We do not limit specific training data for this competition. Participants are free to use any accessible datasets.
Test data: The test dataset comprises approximately ~300 RGB images involving arbitrary objects. The images are randomly selected from various public sources (all new data annotated by LPCVC team, no public data involved). Each image is accompanied by several text descriptions indicating various targets or the whole image.
Sample data: We offer ~50 sample images with textual annotations for participants to explore and validate their solutions as a starting point. Please note that the sample data are not included in the final test stage dataset. The entire sample dataset will be accessible at Google Drive [GoogleDrive (TO BE UPDATED) ].
4.2 Task
Goal: Retrieve the most relevant text descriptions for a given image from a candidate pool.
Model Input: An RGB image, and all textual descriptions in the dataset
Model Output: Image and text embeddings (will be used to calculate similarity then ranking)
* Check the provided [Sample Solution (TO BE UPDATED)] for detailed input and output data format for the evaluation pipeline
4.3 Metrics
The evaluation is conducted in two stages:
4.4 Sample Solution
We accept [OpenAI-CLIP] as a sample solution to better support potential participants. The corresponding latency (inference time) on the test data will be used as the reference to determine if the submitted solutions are valid or not.
4.5 Data Format in Evaluation: All test data will be used to evaluate the submitted solutions online using AIHub. Thus, we prepare the test data into a specific format to fit the requirements of the AIHub platform and QNN libraries. (*mainly follows OpenAI-CLIP setting) (TO BE UPDATED)
Input | Image | Text |
Data Format |
| - text_input: torch tensor int, shape=2x1x77 # CLIP tokenizer output, text_input = [input_ids; attention_mask] |
Explanation | Following preprocessing operations of XDecoder, we (1) first resize the longest edge of the original image to 1024, (2) then padded it to 1024x1024 square by 0s. (3) No normalization applied, so please remember to add normalization and other pre-processing inside your model | Following XDecoder preprocess, and to reduce the influence of using different tokenizers for a fair comparison, we prefixed the tokenizer as `openai-clip`text tokenizer. Check the below sample codes for more details * quoted from transformers.CLIPTokenizer: input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. |
Sample data preparation code | img = Image.open(image_path).convert("RGB") transform = transforms.Compose([ transforms.Resize(1000, max_size=1024), ]) image = transform(img) image = torch.from_numpy(np.asanyarray(image)).float().permute(2, 0, 1) images = [image] image_input = ImageList.from_tensors(images, size_divisibility=1024).tensor | # prefixed text tokenizer from transformers import CLIPTokenizer pretrained_tokenizer = 'openai/clip-vit-base-patch32' tokenizer = CLIPTokenizer.from_pretrained(pretrained_tokenizer) tokenizer.add_special_tokens({'cls_token': tokenizer.eos_token}) # example tokenized text embedding input to the model tokens = tokenizer(text, padding='max_length', truncation=True, max_length=77, return_tensors='pt') text_input = torch.stack((tokens['input_ids'], tokens['attention_mask'])) # Shape: 2x1x77 |
Example data | Input text: “dog.” (*note, all input text are finished with a period “.” following the provided sample solution) Tokenized output = { 'input_ids': tensor([[49406, 2308, 1774, 15762, 269, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]], device='cuda:0', dtype=torch.int32) 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)} |
____________________________________________________________________________________
5. Compile, Profile, Inference via AIHub
Please refer to the provided [Sample Solution]for details of model compile, profile, and inference via AIHub.
Important! After the close of the submission window, the TOP-5 teams on the leaderboard will be contacted to confirm which model will be their final solution used for the evaluation on the whole test data. The converted ONNX model as well as detailed evaluation scripts should also be requested in addition to the QNN model shared via AIHub.
____________________________________________________________________________________
6. References