Millions of people communicate using sign language, but so far projects to capture its complex gestures and translate them to verbal speech have had limited success. A new advance in real-time hand tracking from Google’s AI labs, however, could be the breakthrough some have been waiting for.
The new technique uses a few clever shortcuts and of course the increasing general efficiency of machine learning systems to produce, in real time, a highly accurate map of the hand and all its fingers, using nothing but a smartphone and its camera.
“Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method achieves real-time performance on a mobile phone, and even scales to multiple hands,” write Google researchers Valentin Bazarevsky and Fan Zhang in a blog post. “Robust real-time hand perception is a decidedly challenging computer vision task, as hands often occlude themselves or each other (e.g. finger/palm occlusions and hand shakes) and lack high contrast patterns.”
Not only that, but hand movements are often quick, subtle, or both — not necessarily the kind of thing that computers are good at catching in real time. Basically it’s just super hard to do right, and doing it right is hard to do fast. Even with multi-camera, depth-sensing rigs like those used by SignAll have trouble tracking every movement. (But that isn’t stopping them.)
The researchers’ aim in this case, at least partly, was to cut down on the amount of data that the algorithms needed to sift through. Less data means quicker turnaround.
For one thing, they abandoned the idea of having a system detect the position and size of the whole hand. Instead, they only have the system find the palm, which is not only the most distinctive and reliably shaped part of the hand, but is square to boot, meaning they didn’t have to worry about the system being able to handle tall rectangular images, short ones, and so on.
Once the palm is recognized, of course, the fingers sprout out of one end of it and can be analyzed separately. A separate algorithm looks at the image and assigns 21 coordinates, roughly coordinating to knuckles and fingertips, to it, including how far away they likely are (it can guess based on the size and angle of the palm, among other things).
To do this finger recognition part, they first had to manually add those 21 points to some 30,000 images of hands in various poses and lighting situations, for the machine learning system to ingest and learn from. As usual, artificial intelligence relies on hard human work to get going.
Once the pose of the hand is determined, that pose is compared to a bunch of known gestures, from sign language symbols for letters and numbers to things like “peace” and “metal.”
The result is a hand-tracking algorithm that’s both fast and accurate, and runs on a normal smartphone rather than a tricked-out desktop or the cloud (i.e. someone else’s tricked-out desktop). It all runs within the MediaPipe framework, which multimedia tech people may already know something about.
With luck other researchers will be able to take this and run with it, perhaps improving existing systems that needed beefier hardware to do the kind of hand recognition they needed to recognize gestures. It’s a long way from here to really understanding sign language, though, which uses both hands, facial expressions, and other cues to produce a rich mode of communication unlike any other.
This isn’t being used in any Google products yet, so the researchers were free to give their work away for free. The source code is here for anyone to take and build on.
“We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues,” they write.