Despite a slew of artificial intelligence processors poised to reach the market — each boasting its own “breakthrough” — myriad problems continue to dog today’s AI community ranging from issues with the energy, speed, and size of AI hardware to AI algorithms that have yet to demonstrate improvements in robustness and performance.
In computer vision, the biggest challenge is how to “make visual analysis more efficient,” Rogerio Feris, research manager for computer vision and multimedia at IBM Research, told EE Times.
To be clear, AI is still in an early phase. It needs fresh ideas, a long-term vision, and more heavy lifting in R&D by academia and research institutes.
IBM Research is coming to Salt Lake City this week to present two technical papers — on AI hardware and software — at the 2018 Conference on Computer Vision and Pattern Recognition (CVPR), sponsored by the Computer Vision Foundation and IEEE Computer Society. IBM called CVPR one of the most competitive computer-vision conferences.
For AI hardware, IBM Research is promoting a stereo-vision system developed by applying brain-inspired spiking neural-network technology to both data acquisition (sensors) and data processing. IBM’s stereo-vision design leverages the company’s own TrueNorth chips — a non-von-Neumann-based processor — and a pair of event-driven cameras developed by iniLabs (Switzerland).
For AI software, IBM Research is unveiling a paper on “Blockdrop,” a step deemed critical to reduce the total computation required by deep residual networks.
Both papers attack the same problem — efficiency in visual analysis — from two different angles, explained Feris.
‘BlockDrop’ for residual networks
If someone is crossing a road, an autonomous vehicle (AV) is expected to make “real-time inferences,” noted the IBM research manager. While image-recognition accuracy is critical, the time that it takes to reach such a conclusion as it is recognizing something is the ultimate test in real-world applications.
The residual network, for example, took the computer-vision community by storm as the 2015 winner of the ImageNet competition. It has proven to offer excellent recognition results because it can train hundreds or even thousands of layers in neural networks. “But applying such a one-size-fits-all computation [required by the residual network] to all the imaging is too inefficient,” according to Feris. The image of a dog on a clean white background, for example, should be infinitely easier to recognize than a dog in a busy cityscape.
With this in mind, IBM Research developed BlockDrop, an approach that “learns to dynamically choose which blocks — multiple layers — in the residual network to execute during inference,” explained Feris. “The goal is to best reduce total computation without degrading prediction accuracy.”
IBM claims that in testing, BlockDrop increased recognition speed by 20% on average and sometimes accelerated the process by as much as 36% without sacrificing the accuracy that the residual network achieved in the ImageNet dataset. IBM’s research work, begun in the summer of 2017, was a collaboration with teams at the University of Texas and the University of Maryland, according to Feris. IBM will be making the source code for BlockDrop available to an open-source community.
Neuromorphic technology for stereo vision
On the hardware side, IBM Research has zeroed in on a stereo-vision system that uses a spiking neural network. Until now, the industry has used two conventional [frame-based] cameras to bring stereoscopic visions, but nobody has tried neuromorphic technology, according to IBM.
While it’s not impossible to get stereo images with conventional cameras, one would need quality image signaling pipelines for high-dynamic-range (HDR) imaging, Ultra HD processing, and automatic calibration in a stereo camera.
Described in its paper, “A Low Power, High Throughput, Fully Event-Based Stereo System,” IBM used two event-driven cameras (also known as Dynamic Vision Sensors, or DVSes) developed by iniLabs. They capture scenes and apply a cluster of IBM TrueNorth chips to extract the depth of rapidly moving objects, according to Alexander Andreopoulos, research scientist for brain-inspired computing at IBM Research.
IBM’s goal is to dramatically reduce power and latency when obtaining a stereoscopic image. After receiving a live-streaming spiking input (which already substantially reduces incoming data), the system uses IBM’s neuromorphic hardware to reconstruct a 3D image by estimating disparities between two images from two DVSes and locating the object in 3D space by triangulating the data.
Data acquisition & data processing
Although IBM’s Andreopoulos said that he is not familiar with it, Prophesee, a startup based in Paris, is also already using neuromorphic technology on data acquisition to reduce data collected by sensors.
Prohpesee’s sensor technology is not frame-based. It is designed to acquire data that’s simplified and tailored for machine use. This much-reduced data load should allow cars to make almost real-time decisions, Prophesee’s CEO, Luca Verre, told EE Times in previous interviews.
New in IBM’s stereo system, however, is that it applies brain-inspired technology not only on data acquisition but on data processing to reconstruct a stereo view. In contrast, Prophesee focuses on data acquisition.
One of the biggest accomplishments in this system, said Andreopoulos, is that the research team succeeded in “programming its TrueNorth chips” to efficiently run “a diverse set of common sub-routines necessary for stereo vision on a spiking neural network.”
IBM explained that architecture based on TrueNorth chips uses “much less power than conventional systems, which could benefit the design of autonomous mobile systems.”
Similarly, the use of a pair of DVS cameras (non-frame based) that respond only to changes in the scene resulted in “less data, lower energy consumption, high speed, low latency, and good dynamic range, all of which are also key to the design of real-time systems,” according to IBM.
Asked about specific gains demonstrated by the new TrueNorth-based system, Andreopoulos said, “We proved a 200-times improvement in terms of power per pixel disparity map” compared to the closest state-of-the-art system based on conventional processors such as CPUs, GPUs, or FPGAs. With event-based input, the live-feed version of the IBM system running on nine TrueNorth chips is shown to calculate 400 disparity maps per second with up to 11-ms latency. IBM states in its paper that the system can increase this rate to 2,000 disparities per second, depending on “certain trade-offs.”
Asked when a stereo-vision system based on IBM’s TrueNorth chips can go commercial, Andreopoulos said, “We can’t even tell you when. We tested it, and we successfully programmed our chips to run disparity map efficiently. This is a proof-of-concept at this stage.”
— Junko Yoshida, Chief International Correspondent, EE Times