Sunday, September 23, 2007

Block Level Diagrams and Bandwidth Analysis

If you truly want to learn about the potential of any given server system, I believe that it is critical to spend time analyzing a block level diagram of the board or chipset layout. Many times, these diagrams can give you a pretty good or sometimes even an exact figure for the theoretical bandwidth of a given system.

So, how can we gain benefit from this analysis?

It's pretty easily actually.

First start with CPU to memory bandwidth. A well thought out block diagram will show the theoretical bandwidth from CPU to memory. In addition, it's important to verify whether or not the system is laid out to support single or multiple front side busses. Or it may not even use a front side bus architecture.

Systems with AMD processors support integrated memory controllers in their CPU's. This allows for each CPU to be directly attached to memory. Technologies such as node and bank interleaving, as well as NUMA (non-uniform memory access) can allow for extraordinary aggregate memory performance on systems with CPU integrated memory controllers. This is one of the key weaknesses in Intel x86-64 processors, but Intel claims that they will have this technology shipping in the next several months.

Once you figure out CPU to memory throughput, you then need to analyze the block diagram and see the level of bandwidth allowed from the main chipset chips to the embedded and add in peripheral slots (I.E. PCI Express). For example, your system might have four x16 PCI Express slots (64 lanes), but the system might only have an aggregate of 32 lanes running from the four slots to the primary or secondary chipset controller. Therefore, your theoretical bandwidth out of these four slots is 32 lanes of PCI Express (16 GB/s) instead of the expected 64 (32 GB/s).

Oddly enough, many systems are laid out in a very inefficient manner here. It is not uncommon to see 1 or 2 Gigabit Ethernet ports sitting on a single 32 bit PCI bus. As 2 Gig-E ports can theoretically transfer data at 500 MB/s full duplex, this creates a major bottleneck on a 32 bit PCI bus that transfers theoretically at approximately 132 MB/s. This is why I often hear people complain about the performance of embedded network controllers. Add to this fact that many chipsets and boards attach extra peripherals to these busses and you have a recipe for a low performance nightmare.

So a quick recap of how to utilize a block level diagram to gauge best case performance:

1) Figure out best case CPU to memory bandwidth. You will not even come close to approaching these numbers realistically using embedded or add-in peripherals, but it does let you know best case theoretical performance.

2) As a side note to item #1, I always like to run CPU to memory bandwidth tests to see the realistic throughput of a system. Many times, you'll be shocked to see that it's far below theoretical expectations. Once again, this number is your high water mark--don't expect external bandwidth at these rates.

3) Verify any bottlenecks between the embedded or add-in peripherals and the chipset interfaces. Many times, you'll find that board layout will keep you far below expected rates.

4) And finally, run benchmarks whenever possible. The block level diagram is only a guide to best case results. There are still a number of hidden or underlying problems or reasons why your system may not achieve expected results.

Till next time...

No comments: