After a while the buzzwords start to ring hollow. What’s “artificial intelligence,” in practical terms? An Orwellian nightmare that will control our every battlefield maneuver? Or a helpful tool to aid the war fighter?
Let’s bring it down to Earth, make it tangible. AI can, for instance, scan a live video feed faster and more accurately than any human and then warn commanders of imminent danger.
At least that’s the premise behind an ongoing project at the Air Force Research Laboratory at Wright-Patterson Air Force Base. “Video is captured with such velocity and volume [that] no individual or team of individuals can hope to analyze that data in a meaningful way,” said Scott Clouse, senior research engineer at the Decision Science Branch.
AI offers a faster, smarter alternative.
While video intelligence can be an invaluable resource, it’s also a massive manpower suck. “In some instances, we are up to teams of 30 people looking at one video feed, just to make sure we don’t miss anything,” Clouse said.
“There is an immediate need to cut down the number of people on a single feed, maybe even to the point where we could have a single person looking at multiple feeds. It could dramatically reduce the workload on the force.”
In late 2017 the AFRL research team hit a milestone in its work, winning the Large-Scale Movie Description Challenge at the 2017 International Conference on Computer Vision in Venice, Italy. In that competition, AI-driven systems were tasked with creating simple written descriptions of short clips taken from commercial film footage.
The techniques used here could in principal serve as the basis for an AI-driven situational awareness tool. While the team kept its movie captions deliberately terse — “Someone looks up, someone reads a letter” — video interpretation on the battlefield could perhaps offer an even deeper dive into video intelligence.
“On the battlefield you are not competing with the other audio, you are not competing with music in the background. You could deliver a more lengthy verbal description,” said Vincent Velten, AFRL’s Multi-Domain Sensing Autonomy Division Decision Science Branch technical advisor.
The time factor
For AI to interpret video data, the system must be told what to look for. Right now, programmers can set a simple system to identify specific shapes or colors or types of activity. Moving forward, they want the AI-driven system to cull more detailed information. In technical terms, the mechanism that enables this is called a recurrent neural net, or RNN.
The RNN adds the memory component to the process. “This is what gives you time, and time is what gives you context,” Clouse said.
“You see a person standing next to the truck and then you see a person sitting in the truck. You can intuit that person got into the truck. You start to see relationships in the sequences.”
There’s some urgency to this work, as the military comes to rely ever more heavily on video capture as a situational awareness tool. Velten pointed especially to the Air Force’s use of video feeds from remotely piloted aircraft.
“The Air Force makes a lot of use of this stuff, and then there are also the small UAVs that are becoming more and more interesting. That is right now the principle motivation,” he said.
The AFRL team needs about five more years to produce a battlefield-worthy version of its video scanning AI tool. To get to the finish line, researchers need to spend more time looking at actual intel and tackling specific military objectives.
Just as last year’s movie competition had a specific tactical goal — caption five seconds of a movie — the researchers need to build their tools around specific ISR objectives. The AI is only as good at what you tell it to do, and they’re still refining the process of writing those instructions for the machines to follow.
“We have a lot of data right now. What we need are concrete objectives to train the system,” Clouse said.
“We need an operational setting where we have some data that is labeled or captioned appropriately that we can feed into the training mechanism, in order to train the systems on what to look for.”
If it works, an AI-driven system could make it easier to pull the most important information from a video feed.
Watching video takes time, far more time than it would take to read a simple sentence or two the sums up the relevant action. “We want to efficiently translate all this video information into a semantic form that is efficient for people to use,” Clouse said.
“Fundamentally it means that you know what is going on and you know what has changed, without having to stare at every frame as it goes by.”