|
As programmable logic devices increase in
density and complexity, the combination
of a feature-rich fabric and sophisticated
design tools enables users to realize their
performance goals faster. Shorter design
cycle times also enable users to lower overall
design costs and meet time-to-market
requirements.
From analyzing 50 customer designs,
we determined that Xilinx Virtex-II Pro
FPGAs enjoy a 40% performance advantage
over their nearest competitor,
Altera Stratix FPGAs, to further realize
the advantages of FPGAs. With densities
ranging from 200,000 to 6 million
system gates, the Virtex-II Pro device was
as much as 123% faster than the Stratix
device.
Figure 1
shows the performance
advantage distribution.
This article highlights how Virtex-II
Pro FPGAs, along with ISE 6 design
tools, provide a 40% performance advantage
when compared to Stratix FPGAs.
Architectural Features
The basic building block in the Stratix
architecture is called a logic element (LE).
An LE contains three functional structures:
a four-input look-up table (LUT), a register,
and a carry chain.
Virtex-II Pro architecture not only
includes the structures found in an LE, but
also additional functionality, such as a
function expander (MUXF), a
MULT_AND arithmetic cell, and a more
logic-rich carry structure.
Furthermore, the Virtex LUT can be
used as a 16-bit shift register or as a singleor
dual-port RAM element. These additional
features in the Virtex-II Pro architecture
enable users to realize higher
design performance, as well describe in
the next section.
MUXF Function Expander
One of the primary factors impacting circuit
performance in FPGAs are logic levels
in the signal path. The function expander
cell represents a 2:1 MUX, which can be
used to build functions wider than four
inputs without the need for additional
LUT logic levels.
For example, using the MUXF, only
four LUTs are required to implement an
8:1 mux in a single LUT logic level. That
same 8:1 mux in the Stratix PLD is implemented
using five LUTs and the implementation
is two LUT logic levels. The
additional LUT logic level adds delay to
the signal path.
The function expander is not limited to
multiplexers; it can be used for many other
logic functions. For example, a MUXF
combined with two LUTs can implement
any function of five inputs, thereby implementing
a full five-input LUT in a single
LUT logic level. A Stratix implementation
would require two or three LUTs, depending
on the function, and would be implemented
in two LUT logic levels.
Figure 2 shows a nine-input function
mapped onto two LUTs (plus one function
expander for the Virtex-II Pro architecture).
The same function requires three
LUTs for the Stratix device and two LUT
logic levels, as opposed to a single LUT
logic level for a Virtex-II Pro device.
The MUXFx component is like having a
five- or six-input LUT. This leads to fewer
logic levels and also far fewer LUTs consumed
(10% on average) than for the same
function in Stratix FPGAs. This results in
higher performance for Virtex-II Pro designs
because fewer logic levels are generally
required for critical paths. At the same time,
less placement and routing congestion occurs
because 10% fewer resources (LUTs) are necessary
to build the same functionality.
Shift Register LUT
A LUT in shift register mode (SRL) can
implement a selectable 16-bit shift register
in a single LUT. The same shift register in a
Stratix device would be implemented using
16 flip-flops and as many as 10 LUTs or a
memory block, a much less flexible manner.
In a Stratix PLD, if the shift register cannot
be implemented in a memory block, a
16-bit shift register implemented using 16
LEs creates added routing congestion that
may impact design performance. If the shift
register requires variable tap selection, this
will add logic levels on the output path,
resulting in much slower operation.
MULT_AND
The MULT_AND arithmetic cell is commonly
used in soft multiplication applications.
However, the flexibility of the FPGA
fabric allows some five-input functions to
be mapped onto a single LUT. For example,
loadable up and down counters implemented
using the MULT_AND function
utilize only one LUT per bit instead of two
LUTs per bit, as in Stratix PLDs. This implementation can result in as much as
30% faster performance in Virtex-II Pro
FPGAs because of the fewer logic levels
and fewer required LUTs.
LUT-based RAMs
A LUT may also be configured as a singleor
dual-port RAM, resulting in very fast
read and write access for smaller data storing
and buffering applications. In Stratix
devices, the smallest RAM configuration
(the M512 blocks) offers much slower
RAM operation and less flexible dual-port
access, while at the same time requiring
greater latency for reads.
The maximum read speeds for the
M512 RAMs are 266 MHz for one-clock
cycle reads and 320 MHz for two-clock
cycle latency, while the Virtex-II Pro
SelectRAM memory allows 360 MHz
read operation with a single clock latency,
as well as asynchronous read capability for
low-latency design requirements.
Because small RAMs are often used as
data storage for small FIFOs, coefficient
storage for DSP filters, buffers for packet
processing, and other applications, having
maximum performance in this structure
can often enable designers to meet their
system performance requirements.
Block RAMs
As most designs typically use a majority of
the RAM memory available on the device,
Stratix users are forced to use the MegaRAM
memory blocks to create their desired functionality.
For the wide (4k x 144) and deep
(64k x 8) configuration of the MegaRAM,
we evaluated the read/write performance of
Virtex-II Pro block RAM configured to the
same width and depths as the Stratix
MegaRAM memory. The results, as presented
in Table 1, show that for the deep and
wide configuration with one clock delay, the
memory read time performance in Virtex-II
Pro FPGAs is approximately 40% and 95%
faster than Stratix FPGAs, respectively.
Table 1 Virtex-II Pro(-7) and Stratix(-5) block RAM performance
| | | Write Speed | Read Speed |
| Configuration | Clock Delays | Stratic [MHz] | Virtex-II Pro [MHz] | Stratix [MHz] | Virtex-II Pro [MHz] |
| Deep Single-Port Memory 64k x 8 | 1 | 287 | 282 | 199 | 282 |
| 2 | 287 | 282 | 287 | 282 |
| Wide Single-Port Memory 4k x 144 | 1 | 255 | 284 | 145 | 282 |
| 2 | 255 | 287 | 255 | 287
|
The wide MegaRAM configuration has
approximately 300 signals that need to be
connected to the relatively small footprint
of the memory block. This leads to registers
and logic competing for optimal
placement locations of a few sites in the
array closest to these memory pins. The
additional routing congestion of these signals
impacts overall memory performance.
Because the Virtex-II Pro configuration
was created using smaller RAMs spread out
over a greater area of the chip, a more optimal
placement and routing could be realized,
resulting in higher performance.
Multiply and Accumulate
Stratix devices contain a dedicated DSP
block; it is often assumed that it can outperform
that same function created in a Virtex-II Pro device. Figure 3 highlights the
maximum performance, with latency, for
the two popular sizes of implementation for
a multiply and accumulate (MAC): 9 x 9
and 18 x 18. This analysis shows that Virtex-II Pro devices have faster performance than
Stratix devices for the MAC function.
Software Features
The FPGA fabric feature set continues to
offer capabilities that improve design performance
and reduce area. For users to
realize these benefits, the software tools
both synthesis and place and route need
to use these architecture capabilities.
Synthesis
FPGA-centric synthesis tools constantly
look for new optimization techniques that
go beyond mere LUT mapping. These
synthesis tools can extract known functions
such as arithmetic functions, memories,
and multiplexers by parsing the RTL
code, automatically mapping these functions
to features on the target architecture.
Synthesis mapping to the MUXF,
MULT_AND, and SRL are examples of synthesis
tools providing architecture-specific mapping to reduce logic levels on the critical
paths, as well as reducing placement
and routing congestion, thereby improving
overall design performance. Synthesis tools
will also automatically infer either the LUT
RAM or block RAM based on the coding
style and the size of memory being used.
For example, the Synplicity® Synplify®
software tool may infer fast LUT RAMs for
as much as 2k of memory.
As FPGAs go deeper into sub-micron
technologies, routing delays become more
predominant, and design performance is
highly influenced by cell placement. Thus,
Xilinx provides detailed timing estimates
to enable synthesis tools to not only select
the best architecture element for the
implementation, but also to improve timing
predictability between post-synthesis
and post-layout. This close technical collaboration
ensures that synthesis optimization
is focused on the path that is critical
to place and route.
Place and Route
A study done by researchers at UCLA
showed that timing-driven placement
algorithms for FPGAs can average 30% off
from optimal results. The study also found
that Xilinx tools do much better than
other tools in the industry. For instance,
the delay generated by the Xilinx ISE placer
was only 8.3% worse than optimal and
only 4.1% worse after routing.
To illustrate this advantage, we compiled
the blowfish encryption algorithm,
an open source design, using ISE 6.2i and
Altera Quartus 3.0 targeting Virtex-II
Pro(-7) and Stratix(-5) devices, respectively.
Figure 4 represents the breakdown of
logic and route delay for the critical path.
This analysis shows that ISE placement
technology is able to provide nearoptimal
placement, resulting in a 80:20
logic:route delay ratio for Virtex-II Pro
FPGAs, whereas the Stratix implementation
using Quartus leads to a 50:50
logic:route delay ratio. As a result, the
design is two times faster when implemented
in a Virtex-II Pro device.
Timing-driven map technology, new
in ISE 6 software, is just one example of
years of Xilinx expertise in place and
route for segmented architectures. This
technology enables the mapper to iterate
between map and place, as shown in
Figure 5, such that the placer can provide
the mapper with suggested slice-level
primitive mapping. This iterative loop
leads to near-optimal slice mapping and
placement, resulting in improved timing,
because the router can now pick the best
route with fewer conflicts for the same
routing resources.
Critical Settings
The performance graphs in Figure 1 show
that the Stratix device outperformed the
Virtex-II Pro device in one design. This is
because our analysis uses default settings
in synthesis, with pipelining off.
Because the design had a multiply function
on the critical path, the Stratix design
had an instantiated pipelined lpm (library
of parameterized modules) multiplier, a
black-box function generated by the
Quartus MegaWizard. For the Virtex-II
Pro design, synthesis inferred the
MULT18x18 primitive.
By changing pipelining to on, the
synthesis tool inferred a MULT18x18S
primitive for Virtex-II Pro FPGAs,
resulting in an implementation with
faster performance compared to Stratix
FPGAs. So, in real-world designs, youll
see that Virtex-II Pro devices almost
always outperform Stratix devices.
Conclusion
Advanced architecture features, such as
MUXFs, SRLs, MULT_ANDs, fast
SelectRAM and block RAM solutions, and
fast dedicated multipliers contribute significantly
to the performance advantage of
Virtex-II Pro devices over Stratix devices.
The combination of an advanced
architecture, the synthesis tools capability
to access architecture-specific features,
and the place and route softwares ability
to deliver near optimal placement for a
segmented architecture result in Virtex-II
Pro FPGAs having a 40% average performance
advantage over Stratix PLDs.
In most cases, the fastest Stratix speed
grade must be used to realize the performance
of the slowest Virtex-II Pro
speed grade. A Stratix device in any speed
grade cannot match the performance seen
in the faster speed grades of Virtex-II Pro
devices. Virtex-II Pro FPGAs reach a new
level of performance not matched by any
other FPGA in the industry today.
Printable PDF version of this article with graphics. (3/15/04) 256 KB |