advertisement.gif advertisement.gif

 
 

How Innovations in DRAM Memory Architecture Promise to Raise Memory Throughput to 51.2GB/s

By Michael Ching
Director of Product Marketing
Rambus Inc.


Processor designers are working hard to satisfy demands from system developers and end-users, but increased processing throughput must be matched by improvements in memory bandwidth to deliver usable system-level performance gains.

Other constraints, such as the need to mitigate electromagnetic effects at multi-GHz switching frequencies, minimising the effects of manufacturing tolerances such as board trace lengths, and delivering continuous reductions in power consumption also influence designers' choices and shape each successive generation of performance enhancements.

Shifting the Balance of Power

For over a decade, increasing processor speeds have spurred development of progressively faster DRAM technologies, to maintain the performance balance and fully utilise the increased processor performance. Double Data Rate (DDR) DRAM, for example, supports faster processor-memory transactions without increasing the Front Side Bus (FSB) clock rate, by transferring data on both the rising and falling clock edges. Transaction speeds are dramatically increased without incurring the EMI, power-consumption and thermal management challenges that usually accompany a higher clock rate. However, some trade-offs are seen in other aspects of the memory's behaviour. For example, "Double-pumping" the memory bus has been accompanied by a doubling of the column pre-fetch buffer, which has had a corresponding effect on column access granularity.

The transition from DDR to DDR2 memory continues this trend. The DDR2 interface operates at twice the speed of the core, which increases the peak transfer rate to 6.4Gbit/s at 200MHz memory bus speed but also introduces a further doubling of the column prefetch buffer depth. With the advent of DDR3 memory, which again doubles the interface speed compared to DDR2, the column prefetch buffer is 8-bits deep corresponding to a sustained access granularity of 128 Bytes for DDR3 modules of 64-bits data width.

Although this progression has dramatically boosted the outright data transfer rate achievable, the accompanying trend towards higher access granularity restricts the performance of applications in many of today's fastest-growing and most exciting market sectors. These include high-resolution graphics, 1080p HDTV, network communications processing, and multi-core supercomputing, which are predicated on ultra-high-speed processing of small blocks of data that are often only a few bits in size. In addition, the temporal locality of these blocks of data tends to be low. For example, in a network-switching application each packet stream is mixed randomly with packets from other simultaneous transfers. This results in a requirement for temporary storage of small packets having no locality of reference.

Current high-bandwidth DRAM architectures will be unable to meet the future demands of these applications, given the high column and row granularity inherent in the memory interface. This high granularity results in inefficient utilisation of the memory bandwidth, since the majority of data retrieved will be discarded by the application.

High-Speed, Fine-Granularity

Development of memory interface and core technologies must now focus on regaining this lost efficiency, to better serve emerging applications that require fine memory-access granularity.

The Extreme Data Rate (Rambus XDR™) memory architecture, which is based on differential and point-to-point signalling, has been developed to deliver a further increase in memory bandwidth as processing speeds continue to rise. The XDR™ DRAM interface while increasing signalling rates, also eliminates the effects of manufacturing variations in PCB trace lengths, and supports scalability to large module capacities without suffering the performance losses usually associated with multi-drop bus topologies.

Memory Interface Innovations

XDR introduces Octal Data Rate (ODR) signalling, which allows data exchange on rising and falling edges of a clock that that is multiplied to four-times the 400MHz system clock. Eight bits of data are transferred per clock cycle, which enables 3.2GHz data rates with a 400MHz clock and provides a scalable path to over 6.4GHz as bandwidth needs increase. In combination with the increased signalling rate, improvements to signal integrity and speed are achieved through the use of Differential Rambus Signalling Level (DRSL) technology. DRSL has a signal excursion from 1.0V to 1.2V, resulting in higher speed and lower power consumption without compromising data integrity.

At the XDR interface, DRSL is applied in combination with Rambus FlexPhase™ Timing Adjustment technology. FlexPhase compensates for incremental effects such as small variations in PCB trace lengths due to manufacturing tolerances, producing controllable and deterministic signal timing that allows systems to operate close to ideal timing parameters rather than worst-case. In addition, Rambus Dynamic Point-to-Point (DPP) technology allows XDR modules to combine the easy scalability of a multipoint topology with the high speed of point-to-point signalling.

By combining these technologies, the XDR interface allows DRAMs featuring the standard core architecture to support signalling from 3.2GHz to 6.4GHz for data bandwidth from 6.4GB/s to 12.8GB/s from a single x16 XDR DRAM component. Further optimisation of the core enables access granularity to be reduced and thereby maximise the benefit of XDR's higher interface speeds in future generations of the XDR product family.

Focus on Core Issues

Let us now discuss the changes that are required in the DRAM core to reduce access granularity. Consider a standard DRAM core organised as eight banks that are logically interleaved, as shown in figure 1. Two sets of data pins divide the banks into two halves that operate in parallel in response to row and column commands. A row command selects a single row within each bank half, and two column commands select two column locations within each row half. Four bank halves make up a quadrant having its own set of column and row decoder circuits.

Figure 1. Standard DRAM core.

After a row command is received, the selected row is sensed and latched. The row-access time, tRR, must elapse before another bank can perform a row access. The bank's row circuitry is occupied throughout this interval. After a column command ("col x") is received, the selected column is accessed. The column-access time, tCC, must elapse before the bank can perform another column access ("col y").

The physical limitations on signal propagation times restrict the bit-transport interval to 0.25ns and constrain the minimum tcc to typically 4ns. Hence the maximum column access rate is 250MHz, and 16 bits are transported on each link during a column access. With 16 data links, the column granularity is 32 bytes. Because tRR is twice tCC, the row granularity is 64 bytes.

Micro-Threading for Bandwidth Efficiency

Reorganising the core into a larger number of banks, each with independent row and column circuitry, provides the opportunity to overcome the restrictions on tRR and tCC. This architecture can be implemented in most modern DRAM cores with minimal area overhead, and allows several small accesses to occur during these time intervals. The enhanced core is said to be micro-threaded.

Figure 2. Micro-threaded DRAM core.

Figure 2 shows the internal details for a micro-threaded DRAM core. There are 16 independent banks, each equivalent to a half-bank of the typical DRAM core shown in figure 1. Even-numbered banks connect to the "A" data pins and the odd-numbered banks connect to "B" data pins. The banks are organised as groups of four, forming quadrants that have dedicated row and column circuitry and are therefore able to operate independently in response to row and column commands. A column access of an upper quadrant is interleaved with the corresponding column access of the lower quadrant.

Figure 3 shows the timing of a transaction for this micro-threaded DRAM component. After a row command ("r0") is received, the selected row (in bank 0) is accessed. A time tRR must elapse before another bank in the same bank quadrant can perform a row access. However, banks in the other three quadrants may be accessed during the interval – row commands r1, r2, and r3 are directed to banks 1, 2, and 3, respectively.

Figure 3. Data transaction timing in micro-threaded DRAM core.

After a column command ("c0x") is received, the selected column is accessed (column 0x of row 0 of bank 0). A time tCC must elapse before this bank can receive another column access command ("c0y"). However, banks in the other three quadrants may be column-accessed during the interval – column commands c1x, c2x, and c3x are directed to banks 1, 2, and 3, respectively.

As with the typical DRAM core example, the tCC interval is 4ns, and the bit transport interval is 0.25ns. However, each column access only transports data for half the tCC interval, and each column access only uses 8 of the 16 data links, resulting in a column granularity of 8 bytes, one-quarter of the previous value. The row granularity is 16 bytes, again one-quarter of the previous value.

Reducing granularity in this way delivers performance advantages for applications in the groups mentioned previously, even though interface transfer bandwidth and core access intervals are unchanged compared to standard non-micro-threaded component. Figure 4 highlights the performance benefit of micro-threading, comparing two DRAMs featuring identical core and interface speeds operating in a graphics application accessing a range of triangle sizes. The micro-threaded core has two to four times the effective triangle access rate.

Figure 4. Comparison of micro-threaded and non-micro-threaded DRAM performance.

By adding this and other innovative features, future generations of the XDR memory architecture are capable of supporting data rates from 6.4GHz to 12.8GHz, thereby dramatically increasing the bandwidth to between 25.6GB/s to 51.2GB/s from a single x32 future generation XDR DRAM component.

Search for Rambus IP here

Conclusion; No Limits

The XDR memory architecture continues to provide unprecedented levels of memory performance to keep up with processor performance requirements in next generation gaming, compute, and consumer platforms. Rambus innovations such as micro-threading effectively regain the memory bandwidth efficiency lost through successive generations of high-speed interfaces that have traded access granularity to gain improvements in maximum data rate.

Continuing this trend, future demand for increased memory bandwidth will require further architectural innovations. Rambus is well placed to meet these requirements again going forward.

About the Author

Michael Ching has over 14 years of experience in high-speed design. He joined Rambus Inc. in 1996, and currently manages marketing of Rambus' high-speed interface products and intellectual property portfolio. At Rambus, he has held various positions in industry-infrastructure enabling and design engineering. Prior to joining Rambus, Michael designed high-speed I/Os for microprocessors for Intel Corporation.

Michael holds a M.S. in electrical engineering from the University of California at Berkeley.


      Copyright © 2008 ChipEstimate.com All rights reserved.  Feedback  Privacy Policy  Terms of Use  Newsletter Archive