Sunday, October 6, 2024

Future Directions: Too Much of a Good Thing

Managing IT complexity is a major problem for most companies. This is especially true for the many small and mid-market companies struggling to manage all of the distributed systems they deployed over the past decade. If you couple this proliferation of systems with the increasing demands being placed on IT organizations because of the move to e-business, you’ll begin to see why so many customers are looking to IBM for help.

In response to those requests for help, IBM started several initiatives aimed at creating a manageable IT structure of the future. This series of articles has explored some of those IBM initiatives and looked at how they might benefit iSeries customers.

In this third and final article in this series, we look at how IBM is exploiting semiconductor technologies to create processors that do far more than just execute the instructions their predecessors did. This is the hardware base for what IBM believes will be the IT structure of the future.

More Hardware than We Need

In the early days of processor design, it was a goal to use as few transistors as possible because they were such valuable devices. The hardware cost of a particular computer was directly proportional to the number of transistors used. As a result, many designers were fixated on counting the transistors in their designs.

The early IBM System/3 processor design had about 3,000 transistors, and the designers were very proud of the fact that they could keep the hardware cost down by using so few devices. The software designers, too, were concerned about hardware costs in those days. For example, an early RPG compiler was designed to run in 4 K of memory so we could offer a System/3 entry model with only 4 K of expensive main memory.

As single chip processors began to appear during the 1970s, there was an even stronger focus on minimizing the number of transistors used. No one wanted to create a design that overflowed onto another chip. The name microprocessor was synonymous with single-chip processor.

Even the move to Reduced Instruction Set Computers (RISC) was heavily motivated by the desire to use fewer transistors. A simpler design, it was argued, would require fewer transistors for the processor than would a Complex Instruction Set (CICS) design. A RISC design would free up more transistors on the chip so they could be used for high-speed cache memories. The result would be higher processor performance than if the cache memories had to be on separate chips. Over time, more and more of these cache memories have been pushed onto the processor chips, until today we have some chips with fully three levels of cache memories.

By the early 1990s, it was becoming clear that designing for the minimum number of transistors might not be the best approach after all. Because of the phenomenon known as Moore’s Law, which says the number of transistors on a microchip doubles about every 18 to 24 months, hardware designers could see the day coming when they would have more transistors on a chip than they could effectively use.
Several university studies started to speculate about how all of the transistors might be used in the future. Specifically, many of these studies focused on designs for chips that contained one billion transistors.

The September 1997 issue of IEEE Computer Magazinepresented seven papers on billion-transistor architectures being studied at leading universities. Some of the proposals were fairly conventional, suggesting that we should continue doing what we already do, only do it better. A couple of proposals were fairly radical, requiring totally new programming models to use them. Two of the proposals, however, were of interest to us at IBM Rochester because they described approaches similar to what we were developing for iSeries processors. I discuss both of these approaches in the following paragraphs.

Single-Chip Multiprocessors

One way to use more transistors is to implement more than one processor on the chip. Having multiprocessors as a fundamental part of even the smallest servers is becoming increasingly important. This is especially true as customers consolidate multiple servers into single larger servers as a way to reduce the complexity and cost of managing those multiple servers. The IBM Power4 chips currently used in the iSeries and the pSeries implement two full 64-bit PowerPC processor cores on each chip. Other vendors are now also beginning to follow suit for many of the same reasons.

This approach of adding more processor cores to a single chip can obviously be extended beyond just two processors, and several university studies have looked at how best to do this. Stanford University has been one of the leading proponents of single-chip multiprocessors and has proposed a billion-transistor processor architecture that implements four to 16 processor cores on a single chip. Their design is such that the processors may collaborate on a single, parallel job or run independent tasks.

However, we probably will not see more than two processor cores on a single chip for the next few years because of a problem with heat dissipation. As we add more transistors on a chip, we also increase the amount of heat that the chip generates.

Too Hot to Handle

The number 174 million should be familiar to some iSeries customers. It is the number of transistors on a Power4 chip. That many transistors can take a lot of power when being pushed at high GHz speeds. To be a little more precise, a Power4 chip running at 1.1 GHz can consume about 115 watts. Those 115 watts of power generate a great deal of heat. Imagine a 115-watt light bulb that is the size of a Power4 chip, and you will get some idea of how hot that chip can be. For comparison purposes, an S-Star processor with its 44 million transistors running at 600 MHz consumes only 12 watts.

All of the heat generated by these chips has to somehow be drawn away from the chips to keep them from burning themselves up. Fortunately, IBM has over the years developed very sophisticated packaging technologies that can dissipate the heat generated by these chips. The multichip modules (MCM) used in the high-end models of the iSeries and pSeries to package four Power4 chips can easily dissipate all of the heat generated by the four chips. This MCM packaging, however, is quite expensive, which limits its use in smaller models.

One approach to reducing the power requirements for a chip is to shrink the size of the transistors, since smaller transistors require less power. The Power4 processors are built in 180-nanometer CMOS technology, where the 180-nanometer refers to the effective width of a transistor. The second generation Power4+ is built in 130-nanometer CMOS technology. Even though it has 184 million transistors, the Power4+ chip with its smaller transistors has an overall size that is smaller than the Power4 chip (267 mm square versus 366 mm square). Power4+ also consumes less power (about 70 watts at 1.2 GHz), which means it generates less heat, and can therefore be used in smaller models.

Notice that if we shrink the size of the transistors, but at the same time increase the number of transistors on the chip, we have done little to solve the heat problem. For this reason, the next-generation Power5 chips will not have significantly more transistors than does Power4.

The Power4 chips were intended for high-end servers. Power5 will have a much broader range of applications, from blade servers to very high-end servers. As a result, the designers had to curtail the large power consumption and the resulting heat generated by Power4. A blade server, for example, because of its packaging is constrained to about 25 to 40 watts.
To achieve this lower power consumption, the Power5 can scale back its power dissipation on parts of the chip, called “voltage islands,” based on its workload. This is similar to what Intel does with some of its Pentium processors. It is estimated that the Power5 will consume between 20 and 135 watts, depending on its workload.

Because Power5 will not have significantly more transistors than does Power4, there is no room for more than two processors on a chip. Power5 will, however, use another technique to achieve what are effectively four processors per chip. That technique is called multithreading.

Simultaneous Multithreading

The University of Washington has for the last few years been leading a proposal it calls simultaneous multithreading (SMT). The idea behind SMT is to share the processor hardware on a chip among multiple threads out of a multiprogrammed workload. A thread, here, is defined as an executable entity of work in the system and can be thought of as a separate instruction stream. SMT is a technique that lets multiple independent threads issue instructions to a processor’s functional units in a single cycle. This uniprocessor proposal departs from traditional uniprocessor architectures that issue instructions from only one thread at a time.

Power5 will feature an SMT design, which allows each processor on the chip to behave like two processors running at full speed. Power5 gets its SMT abilities not through new circuitry but through a different use of the existing execution units, the parts of the chip responsible for executing the various types of instructions. As a result, each Power5 chip will behave as if it had four separate processors. Depending on the application workload, Power5 will be able to increase performance up to four times that of Power4.

Other chip designs from other vendors are also using a form of multithreading. For example, Intel’s “hyperthreading” version of this technology is estimated to increase performance by roughly 20 percent, depending on what program the chip is running.

Of particular interest to iSeries customers is the Star family of processors and their implementation of multithreading. The first use of multithreading in a commercial processor was in 1998 in the Northstar processor.

Each Northstar processor chip contained two complete sets of processor registers to support two separate threads of execution. Associated with each thread were an instruction stream and a set of register values. With two sets of registers, a single Northstar processor alternately worked on two independent instruction streams at a time.

While the Northstar design was not as aggressive as the Power5 design, our performance studies showed that the Northstar processor was able to increase processing throughput by 30 percent for a uniprocessor system and by a similar amount for multiprocessor systems by using the two sets of registers on each chip. Plus, it required less than a 10 percent increase in chip area.
It is important to note that multithreading is not new for IBM. Since 1998, iSeries and pSeries customers have enjoyed the performance benefits of multithreading in all members of the Star family, including the latest S-Star processors. Power5 promises to take multithreading to the next level.

Redundant Hardware for Availability

One aspect of IBM’s autonomic computing initiative is to make servers self-healing. To accomplish this requires a combination of hardware and software. (For more information on the software support for self-healing, see the first article in this series, “Are You Ready for e-Business on Demand? March 2003, article ID 16024.) Here, the focus is on the hardware support. A goal for future processor designs is to create smarter hardware that can detect errors and even anticipate failures.

One of the primary characteristics of IBM’s mainframes has been their extremely high availability. To accomplish this level of availability, the mainframe hardware designers have used redundancy hardware in their processor chips. This redundant hardware allows the processor to check itself, detect any errors that have occurred, and recover from those errors.

Power4 detects a great many errors and recovers from a significant number of them. The target for Power5 is to improve upon this by detecting, anticipating, and recovering from even more errors. To accomplish this, Power5 uses additional circuitry to detect and correct more kinds of errors that may occur. Sources involved with the Power5 design estimate that it is approximately 95 to 97 percent of a mainframe chip’s ability to detect and recover from similar errors. By Power6 it will be equal.

Move Software Functions to Hardware

One of the more intriguing and somewhat controversial uses for additional transistors on a chip is a hardware accelerator. The use of hardware accelerators is not new. Intel and others have for years added chip-based accelerators for various multimedia operations. The use of hardware accelerators for software tasks currently performed in operating systems and middleware is fairly new.

IBM is calling this form of software acceleration “Fast Path” and will begin to implement it on Power processors over the next few years. The idea behind Fast Path is simple. Find those tasks where the processors spend a great deal of their time executing a relatively small number of instructions, and hardwire those tasks into the silicon. For example, several communication tasks, including the TCP/IP processing used to read and write data on a network, are called frequently and tend to tie up a great many cycles in a processor. Such tasks can be offloaded to other special-purpose chips, of course, but a simpler solution is to implement them directly in the processor hardware.

In addition to communication tasks, Power processors in the future will accelerate other frequently used operating system tasks, for example those used to manage the virtual memory subsystem. Some middleware functions will also be accelerated. For example, accelerating the multiple table lookups in the DB2 database could greatly cut the time required to access a record.

Initially, OS/400 and AIX will be able to take advantage of the new hardware accelerators. To ensure that others, including Linux programmers, will also be able to use the new chip features, all of the interfaces to the silicon accelerators will be open.

The use of Fast Path in the Power processor designs has also created a debate in some quarters whether or not these new processor designs are still RISC designs. Some argue that with these complex accelerator functions added, it is no longer a RISC. Others contend that it is simply an extension of the original RISC definition. However you classify it, the modern version of a RISC processor isn’t what it used to be.

Originally published at http://www.iseriesnetwork.com.

Frank G. Soltis of IBM Rochester created the technology-independent architecture used in the AS/400 and iSeries. He is IBM’s iSeries chief scientist and a professor of computer engineering at the University of Minnesota.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles