Sunday, September 23, 2007

Block Level Diagrams and Bandwidth Analysis

If you truly want to learn about the potential of any given server system, I believe that it is critical to spend time analyzing a block level diagram of the board or chipset layout. Many times, these diagrams can give you a pretty good or sometimes even an exact figure for the theoretical bandwidth of a given system.

So, how can we gain benefit from this analysis?

It's pretty easily actually.

First start with CPU to memory bandwidth. A well thought out block diagram will show the theoretical bandwidth from CPU to memory. In addition, it's important to verify whether or not the system is laid out to support single or multiple front side busses. Or it may not even use a front side bus architecture.

Systems with AMD processors support integrated memory controllers in their CPU's. This allows for each CPU to be directly attached to memory. Technologies such as node and bank interleaving, as well as NUMA (non-uniform memory access) can allow for extraordinary aggregate memory performance on systems with CPU integrated memory controllers. This is one of the key weaknesses in Intel x86-64 processors, but Intel claims that they will have this technology shipping in the next several months.

Once you figure out CPU to memory throughput, you then need to analyze the block diagram and see the level of bandwidth allowed from the main chipset chips to the embedded and add in peripheral slots (I.E. PCI Express). For example, your system might have four x16 PCI Express slots (64 lanes), but the system might only have an aggregate of 32 lanes running from the four slots to the primary or secondary chipset controller. Therefore, your theoretical bandwidth out of these four slots is 32 lanes of PCI Express (16 GB/s) instead of the expected 64 (32 GB/s).

Oddly enough, many systems are laid out in a very inefficient manner here. It is not uncommon to see 1 or 2 Gigabit Ethernet ports sitting on a single 32 bit PCI bus. As 2 Gig-E ports can theoretically transfer data at 500 MB/s full duplex, this creates a major bottleneck on a 32 bit PCI bus that transfers theoretically at approximately 132 MB/s. This is why I often hear people complain about the performance of embedded network controllers. Add to this fact that many chipsets and boards attach extra peripherals to these busses and you have a recipe for a low performance nightmare.

So a quick recap of how to utilize a block level diagram to gauge best case performance:

1) Figure out best case CPU to memory bandwidth. You will not even come close to approaching these numbers realistically using embedded or add-in peripherals, but it does let you know best case theoretical performance.

2) As a side note to item #1, I always like to run CPU to memory bandwidth tests to see the realistic throughput of a system. Many times, you'll be shocked to see that it's far below theoretical expectations. Once again, this number is your high water mark--don't expect external bandwidth at these rates.

3) Verify any bottlenecks between the embedded or add-in peripherals and the chipset interfaces. Many times, you'll find that board layout will keep you far below expected rates.

4) And finally, run benchmarks whenever possible. The block level diagram is only a guide to best case results. There are still a number of hidden or underlying problems or reasons why your system may not achieve expected results.

Till next time...

Wednesday, September 19, 2007

Motivations for this Blog

Before we begin to explore this topic in detail, I feel the need to be clear on my motivations for writing this blog.

Simply put, I have always had a fascination with coaxing as much performance out of a system as possible. It's always fun to see how fast you can push a system and also watch a variety of bugs surface during high bandwidth testing.

During my years in enterprise testing, I discovered that a certain dynamic of system performance was routinely overlooked. And this dynamic was simply the performance or bandwidth of embedded or slot based peripherals such as storage and networking adapters.

Many times, these performance anomalies can be traced back to poor design or motherboard layout, but they also can be signs of deeper systemic problems. A true pet peeve of mine is when a server only allows for a fraction of the advertised available bandwidth. Unfortunately, this is often the case.

On a more selfish note, my company created a bandwidth intensive product called Iris. Iris is software that can convert any x86-64 system into a high speed Fibre Channel storage device. Based on the testing of hundreds of systems, it's pretty clear that our solution's primary bottleneck is the poor PCI (both X and Express) bandwidth allowable by so many of today's shipping systems.

So, the real motivation of this blog is twofold. One--it would be nice to educate the public on this shortcoming of enterprise systems. Especially since our checkbooks are the only tangible means of voting that we have in technology. We can choose to only purchase systems that live up to their basic bullet items--performance being one of them.

And second, as is the case with Iris, having fast and reliable systems only allow for faster and more innovative products.

In the words of Ricky Bobby, "If you ain't first, you're last!" So let's go looking for the fastest server shipping today. And let's expose the pretenders to the throne along the way.

Monday, September 17, 2007

Future Proof Servers, Virtualization, and the Need for Massive Server Bandwidth

Those who follow the server industry have likely noticed a trend in four or greater CPU socket servers. For the past few years, these systems have been released with certain basic characteristics:

1) They usually have an extraordinary memory capacity. There are a number of systems already shipping with 256 GB memory capacity, with a possibility of 512 GB if/when 8 GB modules become available.

2) These systems generally have six or more x4 and greater PCI Express slots.

3) They are generally marketed for extreme enterprise usage, especially in regards to virtualization.

So the story is simply this; these servers are being designed for the ambiguous future of virtualization. They have lots of processing power, memory, and add-in PCI Express peripheral space for expandability.

For those of you just tuning in... Virtualization is simply a clever way of placing several operating systems on a single platform. Server hardware has been historically under utilized, so this is an attempt to more efficiently squeeze greater server potential out of a single box. This, of course leads to lower up front and recurring costs over the purchase of several systems to fulfill the same tasks. That's my .02 cent description for now.

The main item here is that virtualization is going to be much more demanding of server hardware. With multiple operating systems and applications on a single platform, the need for additional CPU, memory, and external bandwidth all increase dramatically.

This is one of the primary reasons why we need external bandwidth testing and public results. What good is a massively virtualized server if the data moving in and out of the server is being transferred at a snail's pace? We'll begin to explore this potential in some near future posts.

Bandwidth Increasing Technologies

The computer industry is currently experiencing the ideal environment for high bandwidth server development. Here are a few technologies that are enabling the current bandwidth explosion:

PCI Express
PCI Express, which is a serial version of the previous parallel bus PCI, is standard in many modern systems. The slowest PCI Express slot (x1 or single lane) runs at 250 MB/s in transmit or receive, or bi-directionally at 500 MB/s, which is nearly four times the speed of the previous generation 32 bit PCI. The fastest generation one PCI Express slot (x16) can transfer data bi-directionally at 8,000 MB/s, or over sixty times the speed of 32 bit PCI. This quantum leap in I/O performance gain allows for bandwidth at extraordinary levels.

64 Bit and Multiple Core Processors
Both AMD and Intel are offering desktops through server grade CPU’s that are capable of both 32 and 64 Bit addressing. This enhancement allows for native addressing of memory above the 4 GB boundary, as well as a notable performance gain. Previously, this technology was limited to high end CPU’s that cost a multiple of today’s 64 Bit offerings. The latest CPU’s from both vendors also ship with dual and quad cores per processor, which allows for phenomenal multi-processing support.

New Memory Technologies
Currently a number of systems are (or soon will be) shipping with NUMA technology, while other systems allow for multiple front side busses. Both of these technologies allow for performance increases two times or greater than the previous standards in system memory architectures. These technologies open up new performance levels of data bandwidth at very attractive prices. AMD currently has the memory bandwidth lead due to their integrated memory controller (called Direct Connect), but Intel has promised to have integrated memory controllers in their processors sometime next year.

I love competition in the server and CPU markets. Competition has been a tremendous catalyst for the growth of the above bandwidth friendly technologies.

Lesson One: Big B and Little b

One of the initial challenges that you will encounter when attempting to research bandwidth rates is the lack or ambiguity of standards when these metrics are discussed. Let's investigate a couple problem areas here for reference:

Bits or Bytes: This is where we talk about big and little B. A bit is simply a one or a zero and is usually fairly meaningless by itself. A byte is eight bits and is capable of representing 256 unique binary addresses. A bit is written as lowercase b and a byte as uppercase B.

You've probably noticed that data speeds for networking gear is rated in megabits or gigabits, whereas some storage data (and capacity) rates are discussed in megabytes and gigabytes.

In this case, we have a simple conversion:

1 gigabit per second of bandwidth equals 1,000,000,000 bits which is equal to 125 decimal megabytes per second. We achieve this result by simply dividing by 8.

Now why did I use the term "decimal" megabytes? Because networking bandwidth is generally described using power of ten factors (decimal) whereas server bandwidth and usually data storage capacity is measured in powers of 2 (binary). Confusing? Of course it is, but it allows for a higher degree of creativity in the marketing of products. Actually, this is one of the areas where a standard would be nice. I've seen many bright engineers over the years get confused on this topic, sometimes comically and sometimes tragically.

For simplicity and all server bandwidth results, I will use big B (bytes) and power of 2 conversions. And if you ever require clarification, please drop me a note.

The Need For Server Bandwidth Benchmarking

Ten years ago, the average x86 (Intel or AMD CPU based) server was only realistically capable of sending or receiving data at approximately 100 MB/s (megabytes per second) externally. Over the course of the past decade, many advancements have been made that allow for much higher levels of bandwidth. In fact, benchmarks ran in early 2007 showed that at least one system was capable of 7,964 MB/s, which is nearly 80 times faster than just ten years ago. Looking at the roadmaps of both servers and server technologies, I expect that these results may more than double by the end of 2008.

In our society today, we have become ravenous for increased bandwidth. The entertainment industry is a prime example of this phenomenon as we're currently in the midst of an explosion of movie, music, and video downloads. Many believe that the Internet will become the vehicle for most entertainment downloads/purchases in the future.

As our bandwidth needs increase, I believe that it is worthwhile to explore the foundations of bandwidth, which lies in the server system itself. This blog will explore the external bandwidth limits of today's server systems. By external bandwidth, this is simply referring to the bandwidth that can be moved into or out of a server platform. In some cases, we might look at specific technologies, or more importantly we'll simply investigate the aggregate bandwidth potential of a variety of systems.

I intend to keep this site informational and talk about the dynamics of server bandwidth as much as possible. Also, I'd like to use it as a vehicle to publish various benchmarks as they pertain to high bandwidth server applications.

As with most things in life, this is not always a straight forward endeavor, but I will always attempt to explain my findings and reasonings whenever possible.