Finding Novel Targets on the Fly: Using Advanced AI to Make Flexible Automatic Target Recognition Systems

There’s a common trope in movies where a computer system scans thousands of traffic cameras to find a single vehicle of interest, all in a matter of moments.

Like all good fiction, it’s grounded in reality.

Automatic target recognition (ATR) systems use advanced machine learning and artificial intelligence (AI) to search for and identify specified targets. These systems work well when they’re properly trained. The challenge is that it takes a long time to train the system on each new target type.

“Historically, if you wanted to change the target or class, you had to get more data, go through the machine learning pipeline, and then develop and deploy a new model onto the platform,” said Anthony Palladino, a principal scientist at Draper.

His work, however, will streamline that process for Draper’s ATR capabilities.

Palladino and fellow Draper researchers are testing whether state-of-the-art vision language models (VLMs) can be used to instantly update, modify, or adapt the set of desired targets in our ATR systems.

Converting Words to Images

VLMs are an advanced type of multimodal AI that correlates language with images. These models are trained to understand and process videos and images against corresponding text and spoken words.

In simple terms, a traditional AI system trained on images of cars can identify other images of cars, without any deeper understanding. A VLM, however, connects the words “car”, “wheels”, “windshield”, etc. with the visual features of those concepts, and learns an internal semantic representation of what a car looks like. This means one can ask a VLM to find a “truck” (a semantically similar concept), and it can recognize a picture of a truck even if the model had never seen a truck in its training data.

Palladino is specifically using a VLM configured to perform Open Vocabulary Object Detection (OVOD), which is designed to use natural language to describe what we’re looking for. An OVOD system can identify objects based solely on text descriptions, or by leveraging a few exemplar images. This is particularly useful when the specific target is only slightly different from other targets. For example, in a convoy of trucks, an OVOD system can identify just the vehicle with a unique marking or logo on it.

Testing the System in Defense Scenarios

Palladino and his team tested Draper’s VLM-based ATR system in an Office of Naval Research competition to identify unexploded ordnance in airfields. The competition provided imagery collected by a drone that flew about 8 meters above ground using a camera to scan the ground. They tasked the ATR with identifying multiple types of unexploded ordnance using text descriptions, such as: “A 155mm artillery shell is shaped like a bullet and is olive to brown in color with yellow markings.”

The VLM-powered ATR proved largely successful at identifying most of the ordnance, though it had some difficultly differentiating ordinance with similar visual characteristics, as expected. “The system did reasonably well,” said Palladino. “We were able to use our system with very minimal work to update search parameters, and we didn’t have to use thousands of images to create a new model to identify the new desired targets. This drastically cuts down the time required to apply our ATR system to a new task.”

These results will be published in the proceedings of the 37^th annual Innovative Applications of Artificial Intelligence conference (IAAI-25).

Extending the System to Teaming Scenarios

Consider the scenario where a ground teammate (human or robot) is moving through a potentially treacherous environment. An aerial teammate (drone) could watch out for nearby hazards and help the ground teammate navigate safely.

Palladino and team are extending the VLM-based ATR system to improve situation awareness in teams and are calling it: Collaborative Hazard Awareness and Recognition of Terrain (CHART).

The flexibility of the VLM means that one day the system can help a bomb defusing specialist navigate an airfield, with updates like “look out, you’re approaching a grenade at your 2 o’clock”. The next day, the system can help a ground robot navigate an urban environment and avoid tumbling down subway stairs or off a bridge when the ground robot’s forward-facing camera may not be able to see occluded hazards.

The team demonstrated their CHART teaming prototype at Dragon Spear 2024. The invitation-only event, which is sponsored by United States Special Operations Command, is designed to explore emerging technologies to detect, counter, surveil, and decontaminate chemical, biological, radiological, and nuclear threats in operational scenarios.

User Experience

“We’re making this easy to use” said Palladino. Non-technical users may not intuitively know the best way to use language to describe a target of interest. What if a word has multiple meanings? What if a concept can by visualized many ways, but the user has only one in mind?

His team is developing a tool that analyzes the user’s initial target description and provides feedback and guidance on how to refine the description, before executing the mission, to improve ATR performance.

These user feedback results will be published in the proceedings of the SPIE Automatic Target Recognition XXXV conference.

Taking Tech into the World

Going forward, Palladino and his team intend to continue refining the technology to enhance its accuracy and ability to quickly update desired and nuanced targets on the fly.

The potential applications for instant ATR retargeting are substantial. Soldiers in combat settings could connect directly with a drone or satellite-based ATR to identify specific targets or dangers around a corner, for example.

Emergency responders could use the system to identify a lost child in a crowded amusement park based solely on a description of the child’s height, hair color, and clothing.

And, as in the movies, law enforcement officers could update via their on-person radios a search through public surveillance camera footage to find a known suspect who may have changed his appearance with a hat and sunglasses.

“Almost any computer vision task could be enhanced by leveraging these powerful VLMs, and we’re excited to continue expanding our capabilities” said Palladino.