Wednesday, January 08, 2014

Tegra K1 64-bit Denver core analysis: Are Nvidia’s x86 efforts hidden within?

Where are the efforts? Show me the efforts...

Tegra K1 (Two Versions)
On January 5th at CES 2014, Nvidia surprised everyone by revealing that its in-house 64-bit ARM core, Project Denver, would arrive this year in the Tegra K1 (Tegra 5/Logan), rather than next year in Tegra 6 (Parker). The Tegra K1 will ship in two flavors: First there’ll be a fairly standard K1 with a quad-core (4+1) Cortex-A15 CPU (just like Tegra 4), but then in the second half of 2014 there’ll be a K1 that features a dual-core 64-bit ARMv8 Denver CPU. Except the CPU, and larger caches on the 64-bit version, both variants of the K1 SoC seem to be identical — they both have the monstrous 192-core (1 SMX) Kepler GPU. While the GPU is exciting, and we’ll be discussing its ramifications in due course, it is Nvidia’s 64-bit Denver core that we’ll be looking at in this story.

ARMv8, 7-way superscalar, up to 2.5GHz

Nvidia first mentioned Project Denver way back in 2011, also at CES. At the time, Nvidia teased Denver as some kind of super core that would revolutionize PCs and servers — but curiously, not smartphones and tablets. It would now seem that mobile devices are back on the table, but it will very much depend on how much power the Denver cores suck down (more on that later). The Denver cores (and the rest of the SoC) are fabricated on TSMC’s 28nm HPM process and be clocked at up to 2.5GHz. It sounds like both cores will share 128KB of L1 instruction cache and 64KB of L1 data cache.

So far, so good. Much more interesting than clock speeds and caches, though, is Denver’s support for the 64-bit ARMv8 instruction set and an insanely wide “7-way superscalar” architecture. Superscalar, in computing terms, is a kind of CPU architecture that allows for instruction-level parallelism — that is, it can carry out multiple instructions in a single clock cycle. A simple superscalar processor might be capable of fetching and decoding two instructions per clock cycle. To do this, the processor needs to have multiple units that are capable of fetching/decoding/executing/etc simultaneously.

When Nvidia says that each Denver core is 7-way superscalar, it means that it has the hardware resources to perform seven instructions per clock cycle. Nvidia hasn’t said exactly what those hardware resources are (if it can decode seven instructions per cycle we’d be stunned), but it’s pretty clear at this point that Team Green has built an absolutely monstrous chip that should be capable of impressive performance. Maybe Nvidia’s claim that Denver is a “Super Core” isn’t just marketing fluff?

Such performance comes at a cost, though — both in terms of power consumption and die size. We don’t have an exact die size yet, but the Denver core is going to be huge. Considering the two Tegra K1 variants are going to be pin-compatible, and going by the slides published by Nvidia, a Denver core is 2x the size of a Cortex-A15 core — which itself is 3-4x larger than a Cortex-A9 core. Add 192 GPU cores, a memory controller, and all the other bits and pieces, and the Tegra K1 is going to be a very large chip. In terms of power consumption, seven-way instruction-level parallelism is going to be very costly.

Tegra K1 die shot (stylized). This is the Cortex-A15 version (4+1 cores), but it’s so pretty that we’re including it anyway.

Is Denver the reincarnation of Nvidia’s x86 efforts?

If all that wasn’t exciting enough, there’s an interesting theory — proposed by Charlie Demerjian and seconded by AnandTech — that Denver is actually a reincarnation of Nvidia’s plans to build an x86 CPU, which was ongoing in the mid-2000s but never made it to market. To get around x86 licensing issues, Nvidia’s chip would essentially use a software abstraction layer to catch incoming x86 machine code (from the operating system and your apps) and convert/morph it into instructions that can be understood by the underlying hardware. This isn’t an entirely new idea: Transmeta tried and failed at it with its Crusoe and Efficeon CPUs.

In this case, of course, the abstraction layer would catch ARMv8 machine code, rather than x86. Furthermore, if you take that abstraction layer and insert a lot of scheduling and parallelism intelligence, you can correspondingly simplify the hardware, which reduces the die size and power consumption. The 7-way superscalar pipeline would also make more sense in such a setup.

If Denver really is a funky code-morphing/emulating CPU, then the 64-bit version of the Tegra K1 could be a very interesting chip indeed. Given the size of the die and the (expected) complexity of the Denver core, Nvidia will have to do something truly magical (such as a really efficient abstraction layer) to make it fit into a smartphone or tablet’s power envelope. In reality, as Nvidia hasn’t yet specified what market it will target with the Denver-powered Tegra K1, the company itself is probably still carrying out lots of testing and optimization to work out whether it’s a mobile chip or a server chip.

The Denver-based Tegra K1 is expected to hit the market in the second half of 2014. Let’s hope it’s as exciting as it sounds, and not just another high-performance power hog — it’s easy to build those.

Credit: Sebastian Anthony
Source: ExtremeTech

Your VB Kid

0 Codes:

Post a Comment