By: Dan Mepham
The next Pentium 4 processor, Prescott arrives
In 1965, just a few years after the first integrated circuits saw the light of day, a chemist by the name of Dr. Gordon Moore made an observation that would become a guiding rule for the next forty years. His prediction, affectionately dubbed 'Moore's Law' by the press, stated that the speed and number of transistors built into the latest integrated circuits would double every eighteen months. Three years later, in 1968, Moore would go on to co-found what is now the world’s largest semiconductor manufacturer, and would have a first-hand role in ensuring that his prediction would hold. And hold it has. Intel’s first processors in the early 1970’s consisted of just one or two thousand transistors. That increased to tens of thousands in the late seventies as Intel pushed its 8086 processor. Progress continued through the hundreds of thousands of transistors with the 80286 and 80386 families, and finally reached the million transistor mark with the 486DX and its integrated FPU. The nineties was the decade of the Pentium processor, from its 3-million transistor introduction in 1993 to the 25-million transistor Pentium III in the late nineties. Finally, the current Pentium 4 processors boast a modest 55-million transistor count. Key to increasing transistor count, and therefore performance, is the reduction of the size of those transistors. 55 million transistors as they were in 1970 would never have worked; the circuit would have been too huge and too hot to be practical. Decreasing the size of the transistors allows them to be made cheaper, switched faster, and run cooler. Over the decades we’ve seen transistors drop from several microns down to the current 0.13 micron technology.Today we see the introduction of Intel’s smallest mass-produced transistor at just 0.09 microns (90 nanometers). Welcome, ladies and gentlemen, to Prescott Country.
Caching In
The use of the 90nm transistor allows Intel to construct much larger (in terms of the number of transistors) processors, while keeping the physical size small. When processors are manufactured, the yield rates on those processors are directly related to how large, physically, the processors are. A processor that is twice the size of another is essentially twice as likely to contain manufacturing impurities, and therefore will be subject to much lower yield rates. We’ve seen this on a simple basis with respect to Intel’s server products. Later versions of the Pentium III Xeon, for example, incorporated huge on-die caches that bumped the transistor count into the hundreds of millions, and resulted in a die size two to three times the size of a typical desktop processor at the time. These huge Xeons were difficult to manufacture, and came with a corresponding price premium. The move to 90nm technology has allowed Intel to cram a comparatively huge amount of cache memory onto the Pentium 4 die. Prescott improves on the previous Northwood processor by boasting a huge 1MB L2 cache. Despite the larger cache, which helps to drive Prescott’s transistor count to over 125 million, the processor’s physical size remains manageable at only 112 square millimeters – roughly 50% smaller than Intel’s first Willamette Pentium 4 with its tiny 256kB L2 cache.In addition, Intel has also taken the opportunity to increase the size of the Pentium 4’s L1 cache as well. Prescott’s L1 data cache is now doubled to 16kB, while the L1 instruction cache (or Execution Trace Cache) remains at 12k micro-ops. The Pentium 4 was initially designed with a small 8kB L1 data cache as a tradeoff in order to maximize the speed of the cache. Set-associativity of the L1 data cache has also increased from 4-way to 8-way.
Caching In
The use of the 90nm transistor allows Intel to construct much larger (in terms of the number of transistors) processors, while keeping the physical size small. When processors are manufactured, the yield rates on those processors are directly related to how large, physically, the processors are. A processor that is twice the size of another is essentially twice as likely to contain manufacturing impurities, and therefore will be subject to much lower yield rates. We’ve seen this on a simple basis with respect to Intel’s server products. Later versions of the Pentium III Xeon, for example, incorporated huge on-die caches that bumped the transistor count into the hundreds of millions, and resulted in a die size two to three times the size of a typical desktop processor at the time. These huge Xeons were difficult to manufacture, and came with a corresponding price premium. The move to 90nm technology has allowed Intel to cram a comparatively huge amount of cache memory onto the Pentium 4 die. Prescott improves on the previous Northwood processor by boasting a huge 1MB L2 cache. Despite the larger cache, which helps to drive Prescott’s transistor count to over 125 million, the processor’s physical size remains manageable at only 112 square millimeters – roughly 50% smaller than Intel’s first Willamette Pentium 4 with its tiny 256kB L2 cache.In addition, Intel has also taken the opportunity to increase the size of the Pentium 4’s L1 cache as well. Prescott’s L1 data cache is now doubled to 16kB, while the L1 instruction cache (or Execution Trace Cache) remains at 12k micro-ops. The Pentium 4 was initially designed with a small 8kB L1 data cache as a tradeoff in order to maximize the speed of the cache. Set-associativity of the L1 data cache has also increased from 4-way to 8-way.

Figs. 1 & 2 - Color-enhanced photos of Intel's Pentium 4 processor dies. On the left is the 130nm Northwood core; the 90nm Prescott core is on the right. Notice the larger L2 area on the Prescott die.
As you'll see later in the benchmarks, however, there are tradeoffs necessary in order to implement such a large cache.
Branching Off
Intel has further made some subtle but important enhancements to the Pentium 4’s branch prediction systems. Mispredicted code branches result in pipeline stalls as the entire pipeline needs to be flushed to clear the bad branch. With the Pentium 4’s extremely deep pipeline (more on this later), stalls have a dramatic impact on performance.Despite the exemplary accuracy of the Pentium 4’s branch predictor units, there nevertheless exist situations in which the BPU simply cannot make a prediction. In this case, the Branch Target Buffer (BTB) contains no prediction information about the current branch, and so the processor defaults to a rather simple, static prediction algorithm. Intel has enhanced this simple static algorithm to be more accurate. Without excessive description, the new prediction algorithm examines the distance and other properties of the branch to attempt to ascertain whether the branch may be a loop-ending command, and thus whether or not it should be taken. Subtle enhancements have also been made to the dynamic brand prediction algorithms as well.Branch prediction success rate is often difficult to quantify, and changes to branch prediction schemes can show various outcomes, ranging from much better performance, to marginally better performance, or even to decreased performance in some situations. We have been given access to some in-house testing conducted by Intel, and while we cannot post actual numbers at this time, we can summarize the results as follows: Testing using the SPECint_base2000 software showed that Prescott’s mispredicted branch rate ranged from 54% lower to 10% higher than Northwood’s at the extremes, and the overall average branch misprediction rate was about 12% lower on the new Prescott core than Northwood; an impressive improvement.Again, these results are difficult to quantify in terms of real-world performance, but the effects should not be underestimated given the degree to which mispredicted branches impact the performance of Prescott’s deep pipeline.
Intel has further made some subtle but important enhancements to the Pentium 4’s branch prediction systems. Mispredicted code branches result in pipeline stalls as the entire pipeline needs to be flushed to clear the bad branch. With the Pentium 4’s extremely deep pipeline (more on this later), stalls have a dramatic impact on performance.Despite the exemplary accuracy of the Pentium 4’s branch predictor units, there nevertheless exist situations in which the BPU simply cannot make a prediction. In this case, the Branch Target Buffer (BTB) contains no prediction information about the current branch, and so the processor defaults to a rather simple, static prediction algorithm. Intel has enhanced this simple static algorithm to be more accurate. Without excessive description, the new prediction algorithm examines the distance and other properties of the branch to attempt to ascertain whether the branch may be a loop-ending command, and thus whether or not it should be taken. Subtle enhancements have also been made to the dynamic brand prediction algorithms as well.Branch prediction success rate is often difficult to quantify, and changes to branch prediction schemes can show various outcomes, ranging from much better performance, to marginally better performance, or even to decreased performance in some situations. We have been given access to some in-house testing conducted by Intel, and while we cannot post actual numbers at this time, we can summarize the results as follows: Testing using the SPECint_base2000 software showed that Prescott’s mispredicted branch rate ranged from 54% lower to 10% higher than Northwood’s at the extremes, and the overall average branch misprediction rate was about 12% lower on the new Prescott core than Northwood; an impressive improvement.Again, these results are difficult to quantify in terms of real-world performance, but the effects should not be underestimated given the degree to which mispredicted branches impact the performance of Prescott’s deep pipeline.
Round 3, SSE Gets a Refresh
Prescott marks the introduction of Intel’s latest extensions to the IA-32 ISA, adding thirteen new instructions. Most of these new instructions make use of the Streaming SIMD Extension (SSE) registers, and as a result, Intel has named the new instructions SSE3. The majority of these instructions relate to graphics and complex arithmetic operations. Two of the instructions were designed to help software make better use of the processor’s Hyper-Threading capability by helping to indicate when a thread may no longer be engaged in useful work.Naturally the benefits of these added instructions will not become apparent until software developers begin to make use of them. As is generally the case with instruction set extensions, there will be particular pieces of software or particular operations that exhibit very tangible performance improvements, while others really have no use for the added instructions, and thus show no change.
Prescott marks the introduction of Intel’s latest extensions to the IA-32 ISA, adding thirteen new instructions. Most of these new instructions make use of the Streaming SIMD Extension (SSE) registers, and as a result, Intel has named the new instructions SSE3. The majority of these instructions relate to graphics and complex arithmetic operations. Two of the instructions were designed to help software make better use of the processor’s Hyper-Threading capability by helping to indicate when a thread may no longer be engaged in useful work.Naturally the benefits of these added instructions will not become apparent until software developers begin to make use of them. As is generally the case with instruction set extensions, there will be particular pieces of software or particular operations that exhibit very tangible performance improvements, while others really have no use for the added instructions, and thus show no change.
Intel's 2004 Roadmap, Sock-et to Me!
Both Prescott and Northwood are introduced in 3.40 GHz versions today, and both are packaged in the current Socket-478 platform. 3.40 GHz will be the final stop for the Socket-478 platform at the high-end, however.When Intel introduces a 3.60 GHz variant of the Prescott processor in Q2 2004, it will be on the new Socket-775 platform only. Socket-775 boards will have much tougher power design specifications that will be necessary to feed these thirsty processors at 3.60 GHz and above. Subsequent versions of the Prescott processor, including the 3.80 GHz in Q3 2004, and the 4.00 GHz in Q4 2004, will appear on the Socket-775 platform only, as will Prescott’s successor, Tejas, in 2005. All Prescott Pentium 4 processors will operate with an 800 MHz FSB, and will feature Hyper-Threading Technology (excluding the 2.80A GHz model, which uses a 533 MHz bus and no Hyper-Threading).At the low-end, Intel will continue to use the Socket-478 platform for its Celeron processor through 2004. Over the year, the Celeron will slowly ramp up to 3.33 GHz using the 90nm process, and continue to use the Socket-478 platform. It will eventually migrate to the Socket-775 platform as well near the end of the year. All 90nm Celerons will get a bump to 256kB of L2 cache.
Both Prescott and Northwood are introduced in 3.40 GHz versions today, and both are packaged in the current Socket-478 platform. 3.40 GHz will be the final stop for the Socket-478 platform at the high-end, however.When Intel introduces a 3.60 GHz variant of the Prescott processor in Q2 2004, it will be on the new Socket-775 platform only. Socket-775 boards will have much tougher power design specifications that will be necessary to feed these thirsty processors at 3.60 GHz and above. Subsequent versions of the Prescott processor, including the 3.80 GHz in Q3 2004, and the 4.00 GHz in Q4 2004, will appear on the Socket-775 platform only, as will Prescott’s successor, Tejas, in 2005. All Prescott Pentium 4 processors will operate with an 800 MHz FSB, and will feature Hyper-Threading Technology (excluding the 2.80A GHz model, which uses a 533 MHz bus and no Hyper-Threading).At the low-end, Intel will continue to use the Socket-478 platform for its Celeron processor through 2004. Over the year, the Celeron will slowly ramp up to 3.33 GHz using the 90nm process, and continue to use the Socket-478 platform. It will eventually migrate to the Socket-775 platform as well near the end of the year. All 90nm Celerons will get a bump to 256kB of L2 cache.

Fig. 3 - Intel's current 2004 roadmaps suggest the above processors will be introduced in the timeframes indicated. The last Socket-478 Pentium 4 processor is the 3.40 GHz parts introduced today. 
As an aside, these 90nm Celerons may be of some interest to overclockers. A 2.53 GHz (533 MHz FSB) Celeron using the 90nm process will be introduced in Q2 2004, and depending on its price and the maturity of the process at that point, may prove to be a capable overclocker.
Incremental Improvements
Beyond the previously discussed items, Prescott also contains several incremental improvements versus the previous Northwood core. We won’t discuss these in great detail, but rather summarize them briefly below:
Automated functional block design & strained silicon technology
Shifter/Rotator block added to one of the core’s double-speed ALUs
More flexible trace cache
Added a dedicated integer multiplier, which results in lower integer multiply latency.
Increased micro-op scheduler capacity
Improved hardware and software prefetching capability
Additionally, to clear up any confusion that may be caused by the marketing, the following table summarizes current Intel Pentium 4 processors that are available as of today.

Fig. 4 - Intel's current desktop processor lineup. These processors are available at retail and OEM levels as of the time of publication of this article.
Something Rotten in Santa Clara
Despite what seems to be a largely improved processor, and one that should easily outperform a Northwood-core Pentium 4 at equivalent clock speed, this is not the case. Further, there are some strong indications that there is something very seriously wrong with Intel’s 90nm process. Firstly, Prescott was delayed. Earlier roadmaps showed Prescott arriving at the end of 2003, which clearly hasn't been the case. Secondly, Prescott’s pipeline has been deepened versus Northwood’s (probably related to the delays) from 20 stages up to a whopping 31 stages. More importantly, signs indicate that this wasn’t a previously planned change, and Intel seems much less inclined to discuss it than is typically the case when these types of changes are made. From a company that prides itself on adhering to its roadmaps religiously, and that typically talks about these changes openly, this is some rather alarming behavior. Typically a process shrink like this would allow an almost instant boost in clockspeed. The last drop, from the 180nm Willamette down to the 130nm Northwood allowed an almost instant 20% boost in clockspeed, which worked its way up to over 60% as the process was refined. The final Northwood at 3.40GHz is 70% faster than the fastest Willamette as a result of the success of the 130nm process.This time, on the other hand, the drop to 90nm seems not to be resulting in the usual improvements. So much so, in fact, that a rather last-minute change to the pipeline was necessary to produce decent yields at the promised speeds. The longer pipeline will lower Prescott’s IPC, and largely offset any gains as a result of the improvements discussed. See our benchmarks for direct comparison. Some would no doubt argue that Intel is simply taking its time, and preparing for the future, as there's no imminent danger from AMD at the moment (which also seems to be having trouble with its 130nm strained silicon process - coincidence?). There may be some validity to that argument. Unfortunately at this point we can’t offer anything more than speculation. Intel’s public position is that everything is just fine, a 31-stage pipeline was all part of the plan, and it still promises 4GHz by year end. Yet its actions seem to indicate behind-the-scenes scrambling. Usually when there's this much whispering about problems, and such a tight-lipped reaction from the company, there's at least some truth to the speculation. We leave you to form your own conclusions.
Benchmark Configuration
Intel Pentium 4 Processor 3.20E GHz (Prescott)
Intel Pentium 4 Processor 3.06 GHz (Northwood)
Intel Desktop Board D875PBZ, 875 Chipset
512 MB (2 x 256 MB) PC3200 DDR Memory in Dual-Channel Configuration
ATI Radeon 9700 Pro
Western Digital WD400BB 40 GB Hard Disk
Creative Labs SoundBlaster Live!
Enermax EG465P-VE 460W Power Supply
Microsoft Windows XP Professional w/ Service Pack 1
Microsoft DirexctX 9.0
Intel Chipset Drivers v/ 5.00.1012
Intel Application Accelerator v/ 3.5.0.2600
ATI Catalyst 4.1
Intel Pentium 4 Processor 3.20E GHz (Prescott)
Intel Pentium 4 Processor 3.06 GHz (Northwood)
Intel Desktop Board D875PBZ, 875 Chipset
512 MB (2 x 256 MB) PC3200 DDR Memory in Dual-Channel Configuration
ATI Radeon 9700 Pro
Western Digital WD400BB 40 GB Hard Disk
Creative Labs SoundBlaster Live!
Enermax EG465P-VE 460W Power Supply
Microsoft Windows XP Professional w/ Service Pack 1
Microsoft DirexctX 9.0
Intel Chipset Drivers v/ 5.00.1012
Intel Application Accelerator v/ 3.5.0.2600
ATI Catalyst 4.1


0 comments:
Post a Comment