A deep dive into Microsoft’s XBox One’s architecture

Hot Chips 25: The system, CPUs, and main memory under the microscope

XBox One LogoMicrosoft gave a talk about the XBox One architecture at Hot Chips 25 and while they shed light on lots of the architecture there was a fair bit left out. Lets take a look at what they did cover and a little about what they did not.

Console Designs:

As SemiAccurate wrote last Monday the Xbox One architecture is very similar to AMD’s Trinity and Kabini APUs in overarching system design. Microsoft did not just take what was and wire it to their controllers, there were some pretty massive changes to the uncore and south bridge that result in a very unique device. Lets dig in to the details.

XBox One system diagram with the SoC, South Bridge, and peripherals

XBox One system layout including the SoC, South Bridge, and peripherals

First iterations of most consoles tend to be multi-chip devices for cost reasons, the philosophy is to make two of the biggest chips you can afford to then shrink them a few times to drop costs. Traditionally this tends to be two ~300+mm^2 devices which are then shrunk to two smaller devices when the next process becomes affordable. This is then followed up by combining them in to one SoC smaller than the two older ones combined on the next shrink if costs are amenable. Since the two biggest components of any system are the CPU and GPU that has meant one die for each.

South Bridge:

XBox One (XBO) is a bit different because it is based on a PC architecture, AMD’s APU to be more specific. That means the CPU and GPU are both tightly coupled and on one die somewhat because of functional necessity in a modern system and partially because they can actually do it on one chip now. The main SoC is the overwhelming majority of the functionality while the second chip is effectively a Microsoft designed South Bridge. These devices are usually pin-bound and not very performance sensitive so they are made on a -1 or -2 process.

If you look at what functionality it contains, three USB3 ports, two SATA ports, PCIe, some video IOs, and an eMMC controller, there isn’t really all that much to it. This chip is mainly a boot rom/flash host and some lower speed I/O controllers that handles different voltages in a more cost efficient manner than a high-speed 28nm SoC. Everything that really matters is on the main SoC.

XBox One SoC block diagram

The main XBox One SoC block diagram

Eight CPUs and More:

As you can see there is a lot on this chip, 8 AMD Jaguar cores organized in to two blocks of four each with a 2MB 16-way L2 cache shared among all associated CPUs. As you might expect the two L2s are coherent but not directly addressable by CPUs in the other group of four. Each Jaguar core has 32K of 2-way L1 I$ and 32K of 8-way L1 D$ as well. AMD’s Jaguar only goes up to four cores so Microsoft had to come up with a mechanism to both ensure coherency among the two clusters of four CPUs and the rest of the system including the GPUs. This is where a lot of what Microsoft added to the system came in.

The most notable change here was that the data paths between the CPUs and the North Bridge/system fabric were massively beefed up. The blue arrows above are for coherent memory accesses, yellow for non-coherent traffic and all the major blocks are coherent with each other. If you think about the sheer volume of coherency data that needs to go between the two CPU blocks, Microsoft probably had to beef up the L2 to NB links to almost match that of the L1 to L2 links. While specifics were not given out, SemiAccurate was told it was “significantly wider” along with beefed up buffers and deeper queues. Don’t discount this as a minor change, it is both critical to the system performance and a very complex thing to do. It will be interesting to see how Sony did their variant if they ever give a talk on the PS4 architecture.

Main Memory Speeds, Feeds, and Coherency:

The CPUs connect to four 64b wide 2GB DDR3-2133 channels for a grand total of 68GB/sec bandwidth. Do note that this number exactly matches the width of a single on-die memory block. One interesting thing to note is that the speed of the CPU MMU’s coherent link to the DRAM controller is only 30GBps, something that strongly suggests that Microsoft sticks with Jaguar’s half-clock speed NB. If the NB to DRAM controller is 256b/32B wide, that would mean it runs at about 938MHz, 1.88GHz if it is 128b/16B wide.

SemiAccurate would be very surprised if it was 128b wide, wires are cheap, power saving areas not. Why is this important? Unless Microsoft’s XBox One architects are masochists that enjoy doing needless and annoying work they would not have reinvented the wheel and put an arbitrarily clockable asynchronous interface between the NB and the CPU cores/L2s. Added complexity, lowered performance, and die penalty for absolutely no useful upside is not a good architectural decision. That means the XBox One’s 8 Jaguar cores are clocked at ~1.9GHz, something that wasn’t announced at Hot Chips. Now you know.

The CPU NB also has coherent links to the GPU MMU and I/O MMU, something you would expect on any system that takes GPU compute work seriously. AMD has their HSA/HUMA architecture coming with Kaveri in short order but XBO is based on a design ~1+ generations older so no advanced AMD CPU/GPU coherency here. Luckily Microsoft is on the ball here and put their own mechanism in which they would unfortunately not go in to detail on. What SemiAccurate has heard about it says they did a pretty impressive job but until it is fully disclosed we can’t comment with authority. Lets just leave things at, “From what we can tell it looks good”.

Another thing to notice is a rather odd direct and coherent path between the AV In block in the GPU/accelerator area to the Audio DMA unit in the I/O area. The Audio DMA unit also has a direct link to the Audio Out/Resize/Composting block, both of these are one way. Since one is an inbound unit and the other is an outbound unit that kind of makes sense, if there is need to go the other way the two can talk via the CPU MMU. While this may not make much sense on the surface, much of the XBO’s audio functionality is devoted to processing the Kinect’s data stream so high bandwidth and low latency are kind of necessary. More on this later.S|A

Note: This is the first part and only covers part of the system. More including the GPU, embedded memory, and audio systems to come.

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate