Tesla’s Software Upgrade Included a Massive Jump in Visual Learning Power

Print Email

Earlier this month, Tesla Inc. (NASDAQ: TSLA) began updating all its vehicles with version 9 of the company’s automated driving software. Among the new features introduced in the software update is one called Full 360° View that gives drivers “better situational awareness with a 360-degree visualization of surrounding vehicles.”

That may sound easy, but it’s far from being so. The version 9 software uses all eight cameras (three front, one back, two on each side) to produce that visualization rather than just the three front-facing cameras as in the previous software version.

Again, that may not sound like such a big deal, but a member of the Tesla Members Club who goes by the moniker of “jimmy_d,” a self-described Deep Learning Dork, runs the numbers on just how much computing power is required to process that many camera inputs. It’s a truly staggering calculation.

The version 9 neural network to process all this visual data is, in jimmy_d’s words “a monster.” A neural network is a way of doing machine learning that teaches a computer some task by having it analyze training examples. Each example is, typically, hand-labeled. He explains:

[The version 9] camera network is 10x larger and requires 200x more computation when compared to Google’s Inception [version 1] network from which [version 9] gets its underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that [version 9] is straining the compute capability of the [Autopilot engine]. …

When you increase the number of parameters (weights) in a [neural network] by a factor of 5 [as Tesla did with version 9], you don’t just get 5 times the capacity and need 5 times as much training data. In terms of expressive capacity increase it’s more akin to a number with 5 times as many digits. So if [version 8’s] expressive capacity was 10, V9’s capacity is more like 100,000. It’s a mind-boggling expansion of raw capacity. And likewise the amount of training data doesn’t go up by a mere 5x. It probably takes at least thousands and perhaps millions of times more data to fully utilize a network that has 5x as many parameters.

This network is far larger than any vision [neural network] I’ve seen publicly disclosed and I’m just reeling at the thought of how much data it must take to train it. I sat on this estimate for a long time because I thought that I must have made a mistake. But going over it again and again I find that it’s not my calculations that were off, it’s my expectations that were off.

Is Tesla using semi-supervised training for [version 9]? They’ve gotta be using more than just labeled data – there aren’t enough humans to label this much data. I think all those simulation designers they hired must have built a machine that generates labeled data for them …

With these new changes the [neural network] should be able to identify every object in every direction at distances up to hundreds of meters and also provide approximate instantaneous relative movement for all of those objects. If you consider the [field of view] overlap of the cameras, virtually all objects will be seen by at least two cameras. That provides the opportunity for downstream processing use multiple perspectives on an object to more precisely localize and identify it.

Electrek’s Fred Lambert writes that “Tesla is really doubling down on vision-based autonomous driving and making real progress on that front. … Tesla is now tracking vehicles all around the car and noting the difference between cars, SUVs, trucks, motorcycles, and it even renders pedestrians.”

See the Electrek article for illustrations and more details. jimmy_d’s blog post has more technical details and comments.