LPCV - Low Power Computer Vision

IEEE LOW-POWER COMPUTER VISION CHALLENGE
2020 CVPR WORKSHOP

Online Track - UAV Video

Basic Information:

Sponsor: Facebook
Competition: online submission through lpcv.ai during 2020/07/01-2020/07/31 (updated)
Restriction: Each team can submit at most twice every 24 hours
Vision Task: Optical character recognition of characters in video captured by an unmanned aerial vehicle (UAV). More details below.
Hardware: Raspberry Pi 3B+
Software: Standard system image + PyTorch, built from master. We provide multiple sample solutions at Sample Solution GitHub Repo. The programming language is not specified, but the submission has to be executable using the __main__.py and commands specified in the sample solution.
Reimbursement policy: The first 50 teams that submit solutions will receive $100 per team.
Technical Support: Please join the online discussion forum lpcvc.slack.com.
Winners will be announced at the Low-Power Computer Vision Workshop at 2020 CVPR
Input: Two files: (1) one file of video (MP4) and (2) a text file with a list of questions. The two files are given at the same time (at the beginning). The length of the video is approximately 2-5 minutes.
Output: Answers to the questions.
Evaluation: Sample Solution GitHub Repo This solution will be used as the minimum score for winning teams. A team is disqualified if the team’s solution has correctness below this sample solution. Score = correctness / energy. Correctness = average of the scores of each question key. The score of each answer is the length of the correct answer, minus the distance from the correct answer. Please refer to our GitHub Wiki for detailed information.

Vision Task:

Please see sample data (video) at Data page.

Discover and recognize words, and associate them with video frames. Examples of data (snapshots).

Question	Answer
056	B ELEVATOR CONTROL RESTRICTED AREA ELEVATOR PERSONNEL ONLY
ELEVATOR	B056 CONTROL RESTRICTED AREA PERSONNEL ONLY
MULTIPURPOSE	137 TRAINED PERSONNEL ONLY
137	TRAINED PERSONNEL ONLY
132	MEN
MEN	132
122	CONFERENCE
WOMEN	- (not in the data)
332	- (not in the data)

Frequently Asked Questions (FAQ):

Section A: Inputs & Outputs

Q: Are only English letters and numbers considered?
A: Yes.
Q: Are all English words spelled correctly?
A: Not necessarily. It is possible to have words that do not appear in typical dictionaries. Spell checking is not necessary.
Q: Are only uppercase letters used?
A: Lower case letters are possible. It is also possible to have mixtures of upper and lower case letters.
Q: Are the answers case sensitive?
A: No. All answers will be converted to upper cases (by the referee system) before checking correctness.
Q: Will the case of text in input files be consistent with them in the videos?
A: Yes. For example, if "EXIT;" is a question in the input text, the poster with text "EXIT", not "exit" or "eXit", will show up in the video.
Q: What are the colors of the words or numbers?
A: There is no particular restriction about colors. In the example above, the words are written in black and red. In the real data, the colors may be different.
Q: Will the answers include punctuation marks?
A: No. If a punctuation is detected in the video, it should be removed in the answer. For example: "you're" -> "youre"; "10,000" -> "10000".
Q: Is it possible that the answer to a question has not appeared in the video (e.g. room number doesn’t exist)?
A: It is possible. If this occurs, the correct answer is -.
Q: If an answer has multiple words, how should they be separated?
A: By space. The referee will reduce multiple spaces to only one.
Q: If there are multiple words, how should they be ordered?
A: All reported words will be sorted alphabetically by the referee system before checking correctness. It is not necessary to sort the words before sending them to the referee system.
Q: What is the format of the input text file?
A: A collection of words or numbers separated by semicolons, for example

MULTIPURPOSE; 137; 132; MEN

Q: What is the format of the output text file?
A: Question: Answer separated by semicolons. If one question has multiple answers, separate the answers by space, for example
MULTIPURPOSE: 137 TRAINED PERSONNEL ONLY; 137: TRAINED PERSONNEL ONLY MULTIPURPOSE; 132: MEN; MEN: 132;
*Note: Multiple same words from one question should all be stored into the output answer. For example:
123: at cat at dog dog cat; 234: all bear ring all bear;
Q: Must the answers be ordered by the input questions?
A: No. The outputs should be question: answer pairs. Since the questions are included, the output does not need to follow the input order.
Q: Is it possible that the same question appears twice?
A: No.
Q: Are line breaks needed?
A: No. Line breaks are ignored.
Q: Is it possible that one question may appear in different frames? What should the answer be?
A: Yes. For example, one video may capture the word MEN multiple times if the building has several restrooms. The answer should include all words in the same frames when the question occurs. If the building has several restrooms, the answer should include room numbers and the words in the same frame.
Q: Is it possible to have many similar words in the same frame? If this is possible, sorting the words may cause multiple errors if one word is wrong and put in the wrong place.
A: Similar words are possible but not intentional. The purpose of this competition is to improve the technology of computer vision, not to play games of words. Thus, this should not be a problem.
Q: Is there an upper limit about how many words (or characters) may appear in the same frame? How large are the fonts?
A: There is no pre-set limit. The data should be “reasonable” for human viewers to answer the questions.

Section B: Testing Videos

Q: Are the question - answer pairs always found on signs posted in the building?
A: No. The UAV may stop in front of any words (such as posters).
Q: Will the question-answer pairs consider postings passed by the UAV without stopping? How long will the UAV stop in front of the words?
A: The questions should be “reasonable” for human viewers. It is possible that the UAV may fly very slowly without stopping in front of the words.
Q: Do the words have the same font?
A: Not necessarily.
Q: Are the words (or numbers) always captured by the UAV at 90 degrees (the camera looking straight at the words, giving a “head-on” view)?
A: No. It is possible that the words are captured at angles different from 90 degrees. The data should be “reasonable” when judged by human viewers (possibly need to slow down the video). A figure is shown below:
Q: Will the words always be horizontal?
A: Not necessarily. The data should be “reasonable” when judged by human viewers.
Q: Are all video files captured by the same UAV?
A: The sample data may be captured by different UAVs (with different resolutions). In the competition, all teams will be evaluated using the same test video files captured by the same UAV.
Q: What is the resolution (number of pixels per frame) of the video?
A: Different UAVs may be used and their cameras may have different resolutions. Your solution has to be decided.
Q: Is it possible that data from two different UAVs are concatenated into one video clip?
A: No. One video clip is captured continuously.

Section C: Evaluation

Q: How do you evaluate the distances of words?
A: We use the Levenshtein distance. A reference implementation is https://pypi.org/project/python-Levenshtein/0.12.0/
Q: If a word (or a number) is partially correct, how many points will the answer get?
A: Please see the answer to the previous question.
Q: Can the data be sent to a cloud?
A: No. The embedded computer test environment is disconnected from the Internet.
Q: Can I log into the embedded computer and install my software?
A: No. The embedded computer is disconnected from the Internet.
Q: What happens after running one solution?
A: The embedded computer test environment is reset so that all solutions start from the same initial state.
Q: Will the computation be performed while a UAV is flying?
A: No. The data has been captured in advance and will be given as a file. This is necessary to ensure fairness: All solutions are evaluated using the same data. Also, this allows the referee system to run 24 hours to accommodate many contestants.
Q: Do I have to use the Raspberry Pi 3B+?
A: Yes. This is an online competition and it is necessary to have the same hardware so that the solutions can be compared fairly.
Q: What software framework will be used?
A: PyTorch. This is an online competition and it is necessary to have the same software framework.
Q: How is power consumption measured?
A: Yokogawa WT 300 power meter. Idle power will be subtracted.
Q: How much time should my program run? Is there any restriction?
A: Suppose the length of a video is n seconds. Your program must finish within 5 n seconds. You can process the video as fast as you wish. Your program must not take longer than 500% (updated) of the video time. If a video is 3 minutes (180 seconds), your program must finish within 540 seconds.
Q: What happens if my program takes shorter?
A: When your program finishes, your program should issue a command to stop power measurement.
Q: What happens if my program take longer than the allowed time?
A: If your program takes too long, your solution is disqualified.
Q: How will the limit of execution time be measured?
A: The power consumption and execution time will be measured by the referee system. It will not be measured by a human pressing a stopwatch.
Q: If you want to give freedom, why do you impose the maximum execution time to be no longer than 500 (updated) of the video time?
A: The referee system has to evaluate many solutions. Thus, the referee system must not allow any single solution to take too long.
Q: Will you provide training data?
A: No. There are many datasets for recognizing numbers and characters. To keep the flexibility, we do not provide any formats of training data (e.g. annotation, bounding boxes). Sample test data will be released at Data page
Q: Can I inspect the referee system?
A: Yes. The referee system will be open-source. Please visit lpcvc.slack.com.
Q: What is the flow of information of the referee system?

Q: Can a solution reduce the number of pixels before recognizing the words?
A: Yes.
Q: Can a solution skip frames before recognizing the words?
A: Yes.
Q: Will the test data for selecting winners be hidden?
A: Yes. The test data will be different from the sample data and will not be publicly available.
Q: Will you provide a sample solution?
A: Yes. An open-source solution will be provided before the end of April.
Q: How will the winners be determined?
A: The score is accuracy divided by the energy consumption. Energy is the accumulation of power over time. Thus, your solution should consider the relationship between power and time. The sample solution will be the minimum requirement for winning. If a solution has accuracy lower than that of the sample solution, the solution is disqualified.
Q: Who can participate?
A: Anyone in the world can participate as long as this person is not on the restricted list of Embargoed and Sanctioned Countries by the US government. The restrictions are needed because the organizers reside in the US.
Q: Is the competition restricted to students?
A: The competition is open to anyone (again, with the restrictions by the relevant laws).
Q: Are there any rules about conflict of interest?
A: The members in the organizing committee are prohibited from joining any team that enters the competition. The sponsoring organizations are allowed to participate and ranked but are not allowed to receive cash prizes.
Q: Who are the organizers and sponsors?
A: The information is posted at lpcv.ai.

Section D: Other Questions

Q: Why do you use UAVs?
A: UAVs have many applications, such as inspection of bridges, buildings, and land use.
Q: Why do you choose recognizing words as the vision problem?
A: This problem is easy to understand by most people. Humans should be able to answer the questions easily (may need to pause the video from time to time). Some other tasks (such as recognizing cracks in bridges) require more expertise in the specific domains (such as bridge construction). Also, capturing such data requires finding bridges with cracks. This is the first time the low-power competition uses data captured from a UAV. Thus, the organizers choose a simpler vision problem.
Q: How do you protect privacy?
A: No human face appears in the data. The data is captured in the public areas on the campus of Purdue University, West Lafayette, Indiana, USA.
Q: Why do you capture the data inside buildings? Didn’t you say that you wanted to use UAVs to inspect bridges and buildings?
A: Three reasons. First, the FAA (Federal Aviation Administration) has restrictions on flying a UAV outdoors. Second, indoor environments have controlled lighting, without wind, rain, or snow. Third, it is easier to ensure that no human appears in the data for the purpose of protecting privacy.
Q: Isn’t recognizing words a mature technology? Why is this a problem worth solving now?
A: Recognizing words from moving cameras using an embedded computer is not trivial.
Q: What is the long-term goal of this competition?
A: Eventually, sophisticated computer vision should run on UAVs directly. UAVs have limited energy, so low-power computing is important.
Q: Why do you use a Raspberry Pi? Why don’t you run on a UAV’s computer directly?
A: The embedded computers of most commercial UAVs are not user-programmable.
Q: If the long-term goal is to run vision programs on UAVs, should the data be streaming and computation should take as long as the length of the video? Why do you allow solutions that take shorter or longer?
A: The embedded computers of future UAVs may multi-task processing different types of data (in addition to video). Thus, this competition gives contestants the freedom of managing time. If a solution can process video fast, that solution allows the embedded computers to perform other types of computation. This competition does not wish to impose restrictions on how contestants design their solutions.