Xcell Journal Online
  Xcell Journal Article
  Partner Yellow Pages
   
  Xcell Archives
  Order Free Xcell Journal
  Comments & Suggestions
  Write Articles for Xcell

    

Home : Documentation : Xcell Journal Online : Article
Better... Stronger... Faster



by Hitesh Patel, Sr. Manager, Software Product Marketing, Xilinx, Inc.
hitesh.patel@xilinx.com (3/15/04)


Virtex-II Pro FPGAs offer marked performance
advantages over a competing device.
article link to PDF
Article PDF 256 KB


As programmable logic devices increase in density and complexity, the combination of a feature-rich fabric and sophisticated design tools enables users to realize their performance goals faster. Shorter design cycle times also enable users to lower overall design costs and meet time-to-market requirements.

From analyzing 50 customer designs, we determined that Xilinx Virtex-II Pro™ FPGAs enjoy a 40% performance advantage over their nearest competitor, Altera™ Stratix™ FPGAs, to further realize the advantages of FPGAs. With densities ranging from 200,000 to 6 million system gates, the Virtex-II Pro device was as much as 123% faster than the Stratix device. Figure 1 shows the performance advantage distribution.

This article highlights how Virtex-II Pro FPGAs, along with ISE 6 design tools, provide a 40% performance advantage when compared to Stratix FPGAs.

Architectural Features
The basic building block in the Stratix architecture is called a logic element (LE). An LE contains three functional structures: a four-input look-up table (LUT), a register, and a carry chain.

Virtex-II Pro architecture not only includes the structures found in an LE, but also additional functionality, such as a function expander (MUXF), a MULT_AND arithmetic cell, and a more logic-rich carry structure.

Furthermore, the Virtex LUT can be used as a 16-bit shift register or as a singleor dual-port RAM element. These additional features in the Virtex-II Pro architecture enable users to realize higher design performance, as we’ll describe in the next section.

MUXF Function Expander
One of the primary factors impacting circuit performance in FPGAs are logic levels in the signal path. The function expander cell represents a 2:1 MUX, which can be used to build functions wider than four inputs without the need for additional LUT logic levels.

For example, using the MUXF, only four LUTs are required to implement an 8:1 mux in a single LUT logic level. That same 8:1 mux in the Stratix PLD is implemented using five LUTs – and the implementation is two LUT logic levels. The additional LUT logic level adds delay to the signal path.

The function expander is not limited to multiplexers; it can be used for many other logic functions. For example, a MUXF combined with two LUTs can implement any function of five inputs, thereby implementing a full five-input LUT in a single LUT logic level. A Stratix implementation would require two or three LUTs, depending on the function, and would be implemented in two LUT logic levels.

Figure 2 shows a nine-input function mapped onto two LUTs (plus one function expander for the Virtex-II Pro architecture). The same function requires three LUTs for the Stratix device and two LUT logic levels, as opposed to a single LUT logic level for a Virtex-II Pro device.

The MUXFx component is like having a five- or six-input LUT. This leads to fewer logic levels and also far fewer LUTs consumed (10% on average) than for the same function in Stratix FPGAs. This results in higher performance for Virtex-II Pro designs because fewer logic levels are generally required for critical paths. At the same time, less placement and routing congestion occurs because 10% fewer resources (LUTs) are necessary to build the same functionality.

Shift Register LUT
A LUT in shift register mode (SRL) can implement a selectable 16-bit shift register in a single LUT. The same shift register in a Stratix device would be implemented using 16 flip-flops and as many as 10 LUTs or a memory block, a much less flexible manner.

In a Stratix PLD, if the shift register cannot be implemented in a memory block, a 16-bit shift register implemented using 16 LEs creates added routing congestion that may impact design performance. If the shift register requires variable tap selection, this will add logic levels on the output path, resulting in much slower operation.

MULT_AND
The MULT_AND arithmetic cell is commonly used in soft multiplication applications. However, the flexibility of the FPGA fabric allows some five-input functions to be mapped onto a single LUT. For example, loadable up and down counters implemented using the MULT_AND function utilize only one LUT per bit instead of two LUTs per bit, as in Stratix PLDs. This implementation can result in as much as 30% faster performance in Virtex-II Pro FPGAs because of the fewer logic levels and fewer required LUTs.

LUT-based RAMs
A LUT may also be configured as a singleor dual-port RAM, resulting in very fast read and write access for smaller data storing and buffering applications. In Stratix devices, the smallest RAM configuration (the M512 blocks) offers much slower RAM operation and less flexible dual-port access, while at the same time requiring greater latency for reads.

The maximum read speeds for the M512 RAMs are 266 MHz for one-clock cycle reads and 320 MHz for two-clock cycle latency, while the Virtex-II Pro SelectRAM™ memory allows 360 MHz read operation with a single clock latency, as well as asynchronous read capability for low-latency design requirements.

Because small RAMs are often used as data storage for small FIFOs, coefficient storage for DSP filters, buffers for packet processing, and other applications, having maximum performance in this structure can often enable designers to meet their system performance requirements.

Block RAMs
As most designs typically use a majority of the RAM memory available on the device, Stratix users are forced to use the MegaRAM memory blocks to create their desired functionality. For the wide (4k x 144) and deep (64k x 8) configuration of the MegaRAM, we evaluated the read/write performance of Virtex-II Pro block RAM configured to the same width and depths as the Stratix MegaRAM memory. The results, as presented in Table 1, show that for the deep and wide configuration with one clock delay, the memory read time performance in Virtex-II Pro FPGAs is approximately 40% and 95% faster than Stratix FPGAs, respectively.

Table 1 – Virtex-II Pro(-7) and Stratix(-5) block RAM performance
  Write SpeedRead Speed
Configuration Clock Delays Stratic [MHz]Virtex-II Pro [MHz] Stratix [MHz] Virtex-II Pro [MHz]
Deep Single-Port Memory 64k x 8 1 287 282 199 282
2 287 282 287 282
Wide Single-Port Memory 4k x 144 1 255 284 145 282
2 255 287 255 287

The wide MegaRAM configuration has approximately 300 signals that need to be connected to the relatively small footprint of the memory block. This leads to registers and logic competing for optimal placement locations of a few sites in the array closest to these memory pins. The additional routing congestion of these signals impacts overall memory performance.

Because the Virtex-II Pro configuration was created using smaller RAMs spread out over a greater area of the chip, a more optimal placement and routing could be realized, resulting in higher performance.

Multiply and Accumulate
Stratix devices contain a dedicated DSP block; it is often assumed that it can outperform that same function created in a Virtex-II Pro device. Figure 3 highlights the maximum performance, with latency, for the two popular sizes of implementation for a multiply and accumulate (MAC): 9 x 9 and 18 x 18. This analysis shows that Virtex-II Pro devices have faster performance than Stratix devices for the MAC function.

Software Features
The FPGA fabric feature set continues to offer capabilities that improve design performance and reduce area. For users to realize these benefits, the software tools – both synthesis and place and route – need to use these architecture capabilities.

Synthesis
FPGA-centric synthesis tools constantly look for new optimization techniques that go beyond mere LUT mapping. These synthesis tools can extract known functions such as arithmetic functions, memories, and multiplexers by parsing the RTL code, automatically mapping these functions to features on the target architecture.

Synthesis mapping to the MUXF, MULT_AND, and SRL are examples of synthesis tools providing architecture-specific mapping to reduce logic levels on the critical paths, as well as reducing placement and routing congestion, thereby improving overall design performance. Synthesis tools will also automatically infer either the LUT RAM or block RAM based on the coding style and the size of memory being used. For example, the Synplicity® Synplify® software tool may infer fast LUT RAMs for as much as 2k of memory.

As FPGAs go deeper into sub-micron technologies, routing delays become more predominant, and design performance is highly influenced by cell placement. Thus, Xilinx provides detailed timing estimates to enable synthesis tools to not only select the best architecture element for the implementation, but also to improve timing predictability between post-synthesis and post-layout. This close technical collaboration ensures that synthesis optimization is focused on the path that is critical to place and route.

Place and Route
A study done by researchers at UCLA showed that timing-driven placement algorithms for FPGAs can average 30% off from optimal results. The study also found that Xilinx tools do much better than other tools in the industry. For instance, the delay generated by the Xilinx ISE placer was only 8.3% worse than optimal and only 4.1% worse after routing.

To illustrate this advantage, we compiled the “blowfish” encryption algorithm, an open source design, using ISE 6.2i and Altera Quartus™ 3.0 targeting Virtex-II Pro(-7) and Stratix(-5) devices, respectively. Figure 4 represents the breakdown of logic and route delay for the critical path.

This analysis shows that ISE placement technology is able to provide nearoptimal placement, resulting in a 80:20 logic:route delay ratio for Virtex-II Pro FPGAs, whereas the Stratix implementation using Quartus leads to a 50:50 logic:route delay ratio. As a result, the design is two times faster when implemented in a Virtex-II Pro device.

Timing-driven map technology, new in ISE 6 software, is just one example of years of Xilinx expertise in place and route for segmented architectures. This technology enables the mapper to iterate between map and place, as shown in Figure 5, such that the placer can provide the mapper with suggested slice-level primitive mapping. This iterative loop leads to near-optimal slice mapping and placement, resulting in improved timing, because the router can now pick the best route with fewer conflicts for the same routing resources.

Critical Settings
The performance graphs in Figure 1 show that the Stratix device outperformed the Virtex-II Pro device in one design. This is because our analysis uses default settings in synthesis, with pipelining “off.” Because the design had a multiply function on the critical path, the Stratix design had an instantiated pipelined lpm (library of parameterized modules) multiplier, a black-box function generated by the Quartus MegaWizard. For the Virtex-II Pro design, synthesis inferred the MULT18x18 primitive.

By changing pipelining to “on,” the synthesis tool inferred a MULT18x18S primitive for Virtex-II Pro FPGAs, resulting in an implementation with faster performance compared to Stratix FPGAs. So, in real-world designs, you’ll see that Virtex-II Pro devices almost always outperform Stratix devices.

Conclusion
Advanced architecture features, such as MUXFs, SRLs, MULT_ANDs, fast SelectRAM and block RAM solutions, and fast dedicated multipliers contribute significantly to the performance advantage of Virtex-II Pro devices over Stratix devices.

The combination of an advanced architecture, the synthesis tool’s capability to access architecture-specific features, and the place and route software’s ability to deliver near optimal placement for a segmented architecture result in Virtex-II Pro FPGAs having a 40% average performance advantage over Stratix PLDs.

In most cases, the fastest Stratix speed grade must be used to realize the performance of the slowest Virtex-II Pro speed grade. A Stratix device in any speed grade cannot match the performance seen in the faster speed grades of Virtex-II Pro devices. Virtex-II Pro FPGAs reach a new level of performance not matched by any other FPGA in the industry today.

Printable PDF version of this article with graphics. PDF logo (3/15/04) 256 KB

 
职位招聘 本地活动及在线座谈 本地新闻稿 投资者关系 反馈 法律声明 网站地图
© 1994-2008 Xilinx, Inc. All Rights Reserved.