Competition is a beautiful dynamic of technological evolution. Anyone who's been paying attention to the x86 market place has seen some of the fiercest competition recently between Intel and AMD. This is truly a beautiful thing as now we have two companies that are working aggressively to develop better products and beat the other to the next major technological milestone.
As stated in my last column, the real winners here are the consumers. As both Intel and AMD work to make their products better, faster, and more cost effective the end result is ultimately better products at better prices for the consumers.
One area where we will really appreciate this will be in the bandwidth capabilities of x86-64 systems coming out in late 2007 and through 2008. Although AMD has had a commanding lead in CPU to memory and PCI Express bandwidth, Intel is catching up with a vengeance.
Recently, I have seen several released and soon to be released Intel platforms from multiple vendors. I must say that the early bandwidth performance results have been extremely impressive. In fact, I have already seen some Intel systems running at twice the levels of bandwidth of what I had previously recorded.
What makes this news even more exciting is that these results were observed using shipping Intel processors. This means these great results were seen using processors with external memory controllers. It is reasonable (let's say near certainty) to conclude that the Intel processors of 2008, with integrated memory controllers, will only further increase this bandwidth and hopefully by a healthy margin.
So, my crystal ball tells me that in 2008, we're going to see a number of bandwidth records broken by a large margin. And I'm happy to say that we as consumers will have many systems to choose from to meet even the most demanding of bandwidth needs.
Third I/O intends to demonstrate record setting bandwidth performance once again in 2008. We'll keep you up to date on our results as they are revealed to us.
Tuesday, November 13, 2007
Sunday, November 4, 2007
2007's Highest Bandwidth Server
We live in a world of constant change. This is especially true in technology. Today's post will discuss what I believe to be the highest external bandwidth results ever recorded from a commodity x86-64 server platform. However, with the rapid change of technology, it is safe to say that these results will quickly become dated by newer technologies.
Early in 2007, my company, Third I/O, was fortunate enough to participate in a collaborative benchmark with Emulex and AMD. The goal of our benchmark was to establish the highest realistic external throughput of Fibre Channel from a single server system.
After much research, we decided to run our experiments on an HP DL585G2 as it allowed for an extraordinary number of add-in PCI Express peripherals. In addition, the block level diagram for this board showed that it was capable of some pretty amazing performance.
Long story short, our experiments resulted in 7,964 MB/s of full duplex throughput. This means that data was simultaneously being transferred in and out of the system at this extraordinary data rate.
If you'd like further details, check out the performance brief at:
http://www.emulex.com/white/hba/tio-perfbrief.pdf
In closing, I feel it is only fair to state that this same level of performance would likely be seen on other similarily architected AMD systems. However, based on our research and system availability, the HP DL585G2 was the system chosen and it truly did deliver some amazing results.
What lies ahead in 2008? I believe that this will be an amazing year for high bandwidth systems. Intel has many tricks up their sleeve, including integrated memory controllers in their new processors, but AMD is working on several new high bandwidth architectures as well. In the end, this is very good news for us as consumers.
Early in 2007, my company, Third I/O, was fortunate enough to participate in a collaborative benchmark with Emulex and AMD. The goal of our benchmark was to establish the highest realistic external throughput of Fibre Channel from a single server system.
After much research, we decided to run our experiments on an HP DL585G2 as it allowed for an extraordinary number of add-in PCI Express peripherals. In addition, the block level diagram for this board showed that it was capable of some pretty amazing performance.
Long story short, our experiments resulted in 7,964 MB/s of full duplex throughput. This means that data was simultaneously being transferred in and out of the system at this extraordinary data rate.
If you'd like further details, check out the performance brief at:
http://www.emulex.com/white/hba/tio-perfbrief.pdf
In closing, I feel it is only fair to state that this same level of performance would likely be seen on other similarily architected AMD systems. However, based on our research and system availability, the HP DL585G2 was the system chosen and it truly did deliver some amazing results.
What lies ahead in 2008? I believe that this will be an amazing year for high bandwidth systems. Intel has many tricks up their sleeve, including integrated memory controllers in their new processors, but AMD is working on several new high bandwidth architectures as well. In the end, this is very good news for us as consumers.
Wednesday, October 10, 2007
A Cost Effective High Bandwidth System
One of the biggest catalysts to computer hardware development in recent years has been the video game industry. Video adapters and GPU's are the most bandwidth hungry devices in computing today. In fact, the fastest peripheral slot in any shipping system is a Generation 1 x16 PCI Express slot. And the most common adapters that run x16 are video cards.
So, it should really be no surprise that some workstations have the capability for extraordinary external system bandwidth. One of my favorite systems to use for benchmarking is based on the Tyan S2915 (aka the Thunder n6650W). If you're feeling brave, you could build one yourself, or you can always take the route of having a number of reputable whitebox builders create one for you.
I like the S2915 for the following reasons:
1) With dual Opteron processors, this system takes full advantage of the integrated memory controllers and the two memory subsystems--one attached to each CPU via AMD's Direct Connect technology. You can either choose to use a NUMA strategy, or you can configure this system via POST setup to also use either node or bank interleaving. This in itself is pretty nice as I have observed some applications that appear to be much faster under a NUMA strategy, while others prefer the system interleaved options.
2) This system has an amazing 56 PCI Express lanes that run from the primary and secondary Nvidia chips to allow for extraordinary levels of peripheral bandwidth. This layout is even greater than a number of shipping enterprise server systems.
3) The PCI peripheral slots allow for lots of upgradeability. The system has two x16 and two x8 slots. In addition, there are two legacy PCI-X slots that can run at 133 MHz with a single adapter or 100 MHz if both slots are populated.
So, how fast is this system? Really fast in terms of bandwidth. The Sisoft Sandra memory benchmarks show that this system can transfer data at over 12,000 MB/s with 667 MHz DDR2 memory (this is processor to CPU bandwidth).
I was very fortunate to have two of these systems with me at a recent Storage Networking World Conference where I was showing off some pretty impressive Fibre Channel performance. Using four Emulex LPe11002 4 Gb/s Fibre Channel adapters in each system, I was able to transfer data between two S2915 systems at over 5.6 GB/s. This in itself shows the amazing performance of this system. However, these adapters were only running x4 PCI Express.
I can't wait for the next generation of these adapters that will run 8 Gb/s Fibre Channel and x8 PCI Express. It is overly optimistic to believe that this system can deliver at twice this level of performance with faster adapters. After all, this would be 11.2 GB/s of bandwidth, which is more than 93% of the benchmarked CPU to memory bandwidth of this system. And, that would be a stretch... However, I'd love to see how fast this system can be pushed. I don't believe I've reached its limits yet.
So, in summary, if you need a high bandwidth solution, but don't want to pay for an enterprise server, there are several workstation options out there for you to investigate. Based on my experience, the Tyan S2915 blows away the vast majority of shipping servers in bandwidth and it is a solid choice for bandwidth fanatics. However, it is not the only choice in this area, so please do your own research if you're in the market for this type of system.
I would never advise using a solution such as this for any server type applications. Although quite fast, workstations are simply not held (or tested) to the same standards as servers. They do great for demonstrations (or for any application where a reboot isn't the end of the world).
So, it should really be no surprise that some workstations have the capability for extraordinary external system bandwidth. One of my favorite systems to use for benchmarking is based on the Tyan S2915 (aka the Thunder n6650W). If you're feeling brave, you could build one yourself, or you can always take the route of having a number of reputable whitebox builders create one for you.
I like the S2915 for the following reasons:
1) With dual Opteron processors, this system takes full advantage of the integrated memory controllers and the two memory subsystems--one attached to each CPU via AMD's Direct Connect technology. You can either choose to use a NUMA strategy, or you can configure this system via POST setup to also use either node or bank interleaving. This in itself is pretty nice as I have observed some applications that appear to be much faster under a NUMA strategy, while others prefer the system interleaved options.
2) This system has an amazing 56 PCI Express lanes that run from the primary and secondary Nvidia chips to allow for extraordinary levels of peripheral bandwidth. This layout is even greater than a number of shipping enterprise server systems.
3) The PCI peripheral slots allow for lots of upgradeability. The system has two x16 and two x8 slots. In addition, there are two legacy PCI-X slots that can run at 133 MHz with a single adapter or 100 MHz if both slots are populated.
So, how fast is this system? Really fast in terms of bandwidth. The Sisoft Sandra memory benchmarks show that this system can transfer data at over 12,000 MB/s with 667 MHz DDR2 memory (this is processor to CPU bandwidth).
I was very fortunate to have two of these systems with me at a recent Storage Networking World Conference where I was showing off some pretty impressive Fibre Channel performance. Using four Emulex LPe11002 4 Gb/s Fibre Channel adapters in each system, I was able to transfer data between two S2915 systems at over 5.6 GB/s. This in itself shows the amazing performance of this system. However, these adapters were only running x4 PCI Express.
I can't wait for the next generation of these adapters that will run 8 Gb/s Fibre Channel and x8 PCI Express. It is overly optimistic to believe that this system can deliver at twice this level of performance with faster adapters. After all, this would be 11.2 GB/s of bandwidth, which is more than 93% of the benchmarked CPU to memory bandwidth of this system. And, that would be a stretch... However, I'd love to see how fast this system can be pushed. I don't believe I've reached its limits yet.
So, in summary, if you need a high bandwidth solution, but don't want to pay for an enterprise server, there are several workstation options out there for you to investigate. Based on my experience, the Tyan S2915 blows away the vast majority of shipping servers in bandwidth and it is a solid choice for bandwidth fanatics. However, it is not the only choice in this area, so please do your own research if you're in the market for this type of system.
I would never advise using a solution such as this for any server type applications. Although quite fast, workstations are simply not held (or tested) to the same standards as servers. They do great for demonstrations (or for any application where a reboot isn't the end of the world).
Sunday, September 23, 2007
Block Level Diagrams and Bandwidth Analysis
If you truly want to learn about the potential of any given server system, I believe that it is critical to spend time analyzing a block level diagram of the board or chipset layout. Many times, these diagrams can give you a pretty good or sometimes even an exact figure for the theoretical bandwidth of a given system.
So, how can we gain benefit from this analysis?
It's pretty easily actually.
First start with CPU to memory bandwidth. A well thought out block diagram will show the theoretical bandwidth from CPU to memory. In addition, it's important to verify whether or not the system is laid out to support single or multiple front side busses. Or it may not even use a front side bus architecture.
Systems with AMD processors support integrated memory controllers in their CPU's. This allows for each CPU to be directly attached to memory. Technologies such as node and bank interleaving, as well as NUMA (non-uniform memory access) can allow for extraordinary aggregate memory performance on systems with CPU integrated memory controllers. This is one of the key weaknesses in Intel x86-64 processors, but Intel claims that they will have this technology shipping in the next several months.
Once you figure out CPU to memory throughput, you then need to analyze the block diagram and see the level of bandwidth allowed from the main chipset chips to the embedded and add in peripheral slots (I.E. PCI Express). For example, your system might have four x16 PCI Express slots (64 lanes), but the system might only have an aggregate of 32 lanes running from the four slots to the primary or secondary chipset controller. Therefore, your theoretical bandwidth out of these four slots is 32 lanes of PCI Express (16 GB/s) instead of the expected 64 (32 GB/s).
Oddly enough, many systems are laid out in a very inefficient manner here. It is not uncommon to see 1 or 2 Gigabit Ethernet ports sitting on a single 32 bit PCI bus. As 2 Gig-E ports can theoretically transfer data at 500 MB/s full duplex, this creates a major bottleneck on a 32 bit PCI bus that transfers theoretically at approximately 132 MB/s. This is why I often hear people complain about the performance of embedded network controllers. Add to this fact that many chipsets and boards attach extra peripherals to these busses and you have a recipe for a low performance nightmare.
So a quick recap of how to utilize a block level diagram to gauge best case performance:
1) Figure out best case CPU to memory bandwidth. You will not even come close to approaching these numbers realistically using embedded or add-in peripherals, but it does let you know best case theoretical performance.
2) As a side note to item #1, I always like to run CPU to memory bandwidth tests to see the realistic throughput of a system. Many times, you'll be shocked to see that it's far below theoretical expectations. Once again, this number is your high water mark--don't expect external bandwidth at these rates.
3) Verify any bottlenecks between the embedded or add-in peripherals and the chipset interfaces. Many times, you'll find that board layout will keep you far below expected rates.
4) And finally, run benchmarks whenever possible. The block level diagram is only a guide to best case results. There are still a number of hidden or underlying problems or reasons why your system may not achieve expected results.
Till next time...
So, how can we gain benefit from this analysis?
It's pretty easily actually.
First start with CPU to memory bandwidth. A well thought out block diagram will show the theoretical bandwidth from CPU to memory. In addition, it's important to verify whether or not the system is laid out to support single or multiple front side busses. Or it may not even use a front side bus architecture.
Systems with AMD processors support integrated memory controllers in their CPU's. This allows for each CPU to be directly attached to memory. Technologies such as node and bank interleaving, as well as NUMA (non-uniform memory access) can allow for extraordinary aggregate memory performance on systems with CPU integrated memory controllers. This is one of the key weaknesses in Intel x86-64 processors, but Intel claims that they will have this technology shipping in the next several months.
Once you figure out CPU to memory throughput, you then need to analyze the block diagram and see the level of bandwidth allowed from the main chipset chips to the embedded and add in peripheral slots (I.E. PCI Express). For example, your system might have four x16 PCI Express slots (64 lanes), but the system might only have an aggregate of 32 lanes running from the four slots to the primary or secondary chipset controller. Therefore, your theoretical bandwidth out of these four slots is 32 lanes of PCI Express (16 GB/s) instead of the expected 64 (32 GB/s).
Oddly enough, many systems are laid out in a very inefficient manner here. It is not uncommon to see 1 or 2 Gigabit Ethernet ports sitting on a single 32 bit PCI bus. As 2 Gig-E ports can theoretically transfer data at 500 MB/s full duplex, this creates a major bottleneck on a 32 bit PCI bus that transfers theoretically at approximately 132 MB/s. This is why I often hear people complain about the performance of embedded network controllers. Add to this fact that many chipsets and boards attach extra peripherals to these busses and you have a recipe for a low performance nightmare.
So a quick recap of how to utilize a block level diagram to gauge best case performance:
1) Figure out best case CPU to memory bandwidth. You will not even come close to approaching these numbers realistically using embedded or add-in peripherals, but it does let you know best case theoretical performance.
2) As a side note to item #1, I always like to run CPU to memory bandwidth tests to see the realistic throughput of a system. Many times, you'll be shocked to see that it's far below theoretical expectations. Once again, this number is your high water mark--don't expect external bandwidth at these rates.
3) Verify any bottlenecks between the embedded or add-in peripherals and the chipset interfaces. Many times, you'll find that board layout will keep you far below expected rates.
4) And finally, run benchmarks whenever possible. The block level diagram is only a guide to best case results. There are still a number of hidden or underlying problems or reasons why your system may not achieve expected results.
Till next time...
Wednesday, September 19, 2007
Motivations for this Blog
Before we begin to explore this topic in detail, I feel the need to be clear on my motivations for writing this blog.
Simply put, I have always had a fascination with coaxing as much performance out of a system as possible. It's always fun to see how fast you can push a system and also watch a variety of bugs surface during high bandwidth testing.
During my years in enterprise testing, I discovered that a certain dynamic of system performance was routinely overlooked. And this dynamic was simply the performance or bandwidth of embedded or slot based peripherals such as storage and networking adapters.
Many times, these performance anomalies can be traced back to poor design or motherboard layout, but they also can be signs of deeper systemic problems. A true pet peeve of mine is when a server only allows for a fraction of the advertised available bandwidth. Unfortunately, this is often the case.
On a more selfish note, my company created a bandwidth intensive product called Iris. Iris is software that can convert any x86-64 system into a high speed Fibre Channel storage device. Based on the testing of hundreds of systems, it's pretty clear that our solution's primary bottleneck is the poor PCI (both X and Express) bandwidth allowable by so many of today's shipping systems.
So, the real motivation of this blog is twofold. One--it would be nice to educate the public on this shortcoming of enterprise systems. Especially since our checkbooks are the only tangible means of voting that we have in technology. We can choose to only purchase systems that live up to their basic bullet items--performance being one of them.
And second, as is the case with Iris, having fast and reliable systems only allow for faster and more innovative products.
In the words of Ricky Bobby, "If you ain't first, you're last!" So let's go looking for the fastest server shipping today. And let's expose the pretenders to the throne along the way.
Simply put, I have always had a fascination with coaxing as much performance out of a system as possible. It's always fun to see how fast you can push a system and also watch a variety of bugs surface during high bandwidth testing.
During my years in enterprise testing, I discovered that a certain dynamic of system performance was routinely overlooked. And this dynamic was simply the performance or bandwidth of embedded or slot based peripherals such as storage and networking adapters.
Many times, these performance anomalies can be traced back to poor design or motherboard layout, but they also can be signs of deeper systemic problems. A true pet peeve of mine is when a server only allows for a fraction of the advertised available bandwidth. Unfortunately, this is often the case.
On a more selfish note, my company created a bandwidth intensive product called Iris. Iris is software that can convert any x86-64 system into a high speed Fibre Channel storage device. Based on the testing of hundreds of systems, it's pretty clear that our solution's primary bottleneck is the poor PCI (both X and Express) bandwidth allowable by so many of today's shipping systems.
So, the real motivation of this blog is twofold. One--it would be nice to educate the public on this shortcoming of enterprise systems. Especially since our checkbooks are the only tangible means of voting that we have in technology. We can choose to only purchase systems that live up to their basic bullet items--performance being one of them.
And second, as is the case with Iris, having fast and reliable systems only allow for faster and more innovative products.
In the words of Ricky Bobby, "If you ain't first, you're last!" So let's go looking for the fastest server shipping today. And let's expose the pretenders to the throne along the way.
Monday, September 17, 2007
Future Proof Servers, Virtualization, and the Need for Massive Server Bandwidth
Those who follow the server industry have likely noticed a trend in four or greater CPU socket servers. For the past few years, these systems have been released with certain basic characteristics:
1) They usually have an extraordinary memory capacity. There are a number of systems already shipping with 256 GB memory capacity, with a possibility of 512 GB if/when 8 GB modules become available.
2) These systems generally have six or more x4 and greater PCI Express slots.
3) They are generally marketed for extreme enterprise usage, especially in regards to virtualization.
So the story is simply this; these servers are being designed for the ambiguous future of virtualization. They have lots of processing power, memory, and add-in PCI Express peripheral space for expandability.
For those of you just tuning in... Virtualization is simply a clever way of placing several operating systems on a single platform. Server hardware has been historically under utilized, so this is an attempt to more efficiently squeeze greater server potential out of a single box. This, of course leads to lower up front and recurring costs over the purchase of several systems to fulfill the same tasks. That's my .02 cent description for now.
The main item here is that virtualization is going to be much more demanding of server hardware. With multiple operating systems and applications on a single platform, the need for additional CPU, memory, and external bandwidth all increase dramatically.
This is one of the primary reasons why we need external bandwidth testing and public results. What good is a massively virtualized server if the data moving in and out of the server is being transferred at a snail's pace? We'll begin to explore this potential in some near future posts.
1) They usually have an extraordinary memory capacity. There are a number of systems already shipping with 256 GB memory capacity, with a possibility of 512 GB if/when 8 GB modules become available.
2) These systems generally have six or more x4 and greater PCI Express slots.
3) They are generally marketed for extreme enterprise usage, especially in regards to virtualization.
So the story is simply this; these servers are being designed for the ambiguous future of virtualization. They have lots of processing power, memory, and add-in PCI Express peripheral space for expandability.
For those of you just tuning in... Virtualization is simply a clever way of placing several operating systems on a single platform. Server hardware has been historically under utilized, so this is an attempt to more efficiently squeeze greater server potential out of a single box. This, of course leads to lower up front and recurring costs over the purchase of several systems to fulfill the same tasks. That's my .02 cent description for now.
The main item here is that virtualization is going to be much more demanding of server hardware. With multiple operating systems and applications on a single platform, the need for additional CPU, memory, and external bandwidth all increase dramatically.
This is one of the primary reasons why we need external bandwidth testing and public results. What good is a massively virtualized server if the data moving in and out of the server is being transferred at a snail's pace? We'll begin to explore this potential in some near future posts.
Bandwidth Increasing Technologies
The computer industry is currently experiencing the ideal environment for high bandwidth server development. Here are a few technologies that are enabling the current bandwidth explosion:
PCI Express
PCI Express, which is a serial version of the previous parallel bus PCI, is standard in many modern systems. The slowest PCI Express slot (x1 or single lane) runs at 250 MB/s in transmit or receive, or bi-directionally at 500 MB/s, which is nearly four times the speed of the previous generation 32 bit PCI. The fastest generation one PCI Express slot (x16) can transfer data bi-directionally at 8,000 MB/s, or over sixty times the speed of 32 bit PCI. This quantum leap in I/O performance gain allows for bandwidth at extraordinary levels.
64 Bit and Multiple Core Processors
Both AMD and Intel are offering desktops through server grade CPU’s that are capable of both 32 and 64 Bit addressing. This enhancement allows for native addressing of memory above the 4 GB boundary, as well as a notable performance gain. Previously, this technology was limited to high end CPU’s that cost a multiple of today’s 64 Bit offerings. The latest CPU’s from both vendors also ship with dual and quad cores per processor, which allows for phenomenal multi-processing support.
New Memory Technologies
Currently a number of systems are (or soon will be) shipping with NUMA technology, while other systems allow for multiple front side busses. Both of these technologies allow for performance increases two times or greater than the previous standards in system memory architectures. These technologies open up new performance levels of data bandwidth at very attractive prices. AMD currently has the memory bandwidth lead due to their integrated memory controller (called Direct Connect), but Intel has promised to have integrated memory controllers in their processors sometime next year.
I love competition in the server and CPU markets. Competition has been a tremendous catalyst for the growth of the above bandwidth friendly technologies.
PCI Express
PCI Express, which is a serial version of the previous parallel bus PCI, is standard in many modern systems. The slowest PCI Express slot (x1 or single lane) runs at 250 MB/s in transmit or receive, or bi-directionally at 500 MB/s, which is nearly four times the speed of the previous generation 32 bit PCI. The fastest generation one PCI Express slot (x16) can transfer data bi-directionally at 8,000 MB/s, or over sixty times the speed of 32 bit PCI. This quantum leap in I/O performance gain allows for bandwidth at extraordinary levels.
64 Bit and Multiple Core Processors
Both AMD and Intel are offering desktops through server grade CPU’s that are capable of both 32 and 64 Bit addressing. This enhancement allows for native addressing of memory above the 4 GB boundary, as well as a notable performance gain. Previously, this technology was limited to high end CPU’s that cost a multiple of today’s 64 Bit offerings. The latest CPU’s from both vendors also ship with dual and quad cores per processor, which allows for phenomenal multi-processing support.
New Memory Technologies
Currently a number of systems are (or soon will be) shipping with NUMA technology, while other systems allow for multiple front side busses. Both of these technologies allow for performance increases two times or greater than the previous standards in system memory architectures. These technologies open up new performance levels of data bandwidth at very attractive prices. AMD currently has the memory bandwidth lead due to their integrated memory controller (called Direct Connect), but Intel has promised to have integrated memory controllers in their processors sometime next year.
I love competition in the server and CPU markets. Competition has been a tremendous catalyst for the growth of the above bandwidth friendly technologies.
Subscribe to:
Posts (Atom)