AT40K-FFT [ETC]

AT40K-FFT [Updated 8/98. 8 Pages] Fast fourier transform Intellectual Property Core for AT40K FPGAs ; AT40K -FFT [更新8/98 。 8页]快速傅立叶变换知识产权为核心的FPGA AT40K\n
AT40K-FFT
型号: AT40K-FFT
厂家: ETC    ETC
描述:

AT40K-FFT [Updated 8/98. 8 Pages] Fast fourier transform Intellectual Property Core for AT40K FPGAs
AT40K -FFT [更新8/98 。 8页]快速傅立叶变换知识产权为核心的FPGA AT40K\n

文件: 总8页 (文件大小:48K)
中文:  中文翻译
下载:  下载PDF数据表文档文件
Features  
Decimation in frequency radix-2 FFT algorithm.  
256-point transform.  
12-bit fixed point arithmetic.  
Fixed scaling to avoid numeric overflow.  
Requires no external memory, i.e. uses on chip RAM and ROM.  
External access to on-chip RAM for data IO.  
Clock speed of 21 MHz.  
98 ms processing time.  
70% utilization of AT40K30 logic resources.  
48% utilization of AT40K30 RAM resources.  
AT40K FPGA  
IP Core  
Description  
The Fast Fourier Transform (FFT) processor is a FFT engine developed for the  
AT40K family of Field Programmable Gate Arrays (FPGAs). The design is based on a  
decimation-in-frequency radix-2 algorithm and employs in-place computation to opti-  
mize memory usage. In order to operate the processor, data must first be loaded into  
the internal RAM. The processor is then instructed to compute the FFT, overwriting  
the input data in the RAM with the results. Upon completion of the FFT, the results  
may be read out from the RAM via the output data port.  
AT40K-FFT  
Rev. 1132A–08/98  
Operation and Interfacing  
Figure 1. Internal architecture of the FFT processor.  
Address  
Counter  
Address  
Counter  
Data RAM  
8
8
Data input  
Data output  
port  
12  
12  
12  
port  
12  
Butterfly  
12  
Address  
Counter  
Twiddle  
Factor ROM  
8
Control  
Control port  
Figure 1 shows the architecture of the FFT processor. The  
design is based on the radix-2 decimation in frequency  
algorithm and employs in-place computation to optimize  
memory usage.  
To operate the processor data must first be loaded into the  
internal RAM. The processor is then instructed to compute  
the FFT, overwriting the input data in the RAM with the  
results. On completion of the FFT the results can be read  
out from the RAM via the output data port.  
Figure 2. Example FFT processor configuration 1.  
Real  
input  
Address  
8
12  
12  
12  
12  
FIFO  
FIFO  
FFT  
Processor  
µ
Proc  
Data  
24  
Imag  
input  
Input  
Clock  
Processor  
clock  
Figure 3. Example FFT processor configuration 2.  
Address  
8
FFT  
Processor  
µ
Proc  
Data  
24  
Processor  
clock  
AT40K-FFT  
2
AT40K-FFT  
Interfacing requirements are likely to be highly application  
specific. The FFT processor has therefore been designed  
so that users may readily customize the data IO interface to  
meet their requirements. Two examples of likely interfacing  
scenarios are depicted in Figure 2 and Figure 3. In the first,  
the processor is used to perform transforms on real time  
data generated by a pair of ADC’s. This data can either be  
buffered with a FIFO, or be fed directly into the processor,  
depending on whether “end to end” transforms are required  
or not. The results from the FFT processor are then read  
out to a microprocessor for further processing. In the sec-  
ond scenario both the input and output ports of the FFT  
processor are mapped into the memory space of a micro-  
processor. In this configuration the FFT processor can be  
used as a co-processor to the microprocessor.  
further information on the design the reader is referred to  
the design source files, especially the schematics.  
The reader is assumed to be familiar with the various FFT  
algorithms and their classification. General background  
information on the FFT algorithm can be found in [1], [2]  
and [3]. Information on hardware implementations of the  
FFT can be found in [4].  
Butterfly  
This function implements the radix-2 decimation in fre-  
quency butterfly described by the following equations:  
A’=(A+B)/2  
B’=(A-B)xWT/2  
where A and B are the complex inputs to the butterfly, A’  
and B’ are the complex outputs and WT is the complex  
twiddle factor. The divide by two is not normally found in  
the butterfly equation, but is included here to prevent  
numeric overflow in the fixed point implementation.  
The FFT processor uses fractional 12-bit fixed point arith-  
metic. To prevent internal overflows occurring the proces-  
sor scales all intermediate results by ½, thus for a 256 point  
transform the output data is scaled by 1/256. Furthermore,  
to ensure that no overflows occur, the input data must  
observe the following rules:  
Analysis of this function indicates that it requires the follow-  
ing logical operations.  
• 2 complex data fetches  
|Re(Input)+jIm(Input)| < 1  
or  
• 2 complex data stores  
• 1 complex twiddle factor fetch  
• 1 complex addition (2 signed adders)  
• 1 complex subtraction (2 signed subtracters)  
-1Re(Input)<1, Im(Input)=0  
where Re(Input) and Im(Input) are the real and imaginary  
components of the input data.  
• 1 complex multiply (4 signed multipliers and 3 signed  
adders)  
Detailed Design Description  
The architecture of the design is depicted in Figure 1. From  
this it can be seen that the design contains the following  
components.  
Unfortunately it is impossible to execute all these functions  
in a single cycle for the following two reasons. Firstly, the  
dual ported memory bank is incapable of supporting more  
than 1 data write and 1 data read per cycle, and secondly  
the implementation of 4 medium precision array multipliers  
would exceed the logic resources of a single AT40K FPGA.  
• Butterfly  
Twiddle factor ROM  
• Data RAM  
These problems can be overcome if the butterfly calcula-  
tion is performed over two clock cycles. This balances the  
memory and butterfly IO requirements and also halves the  
number of multipliers required, permitting the design to fit  
on a single chip.  
• Address generators  
• Controller  
The design of each of these components is discussed in  
turn, starting with the butterfly. Following this the design  
entry, layout and simulation strategies are described. For  
3
Figure 4. Architecture of the butterfly.  
12  
12  
12  
12  
A
A
B
A'  
A'  
B'  
( )  
Re  
Im  
(
(
), Re  
(
)
Delay  
Re  
Im  
(
(
), Re  
), Im  
12  
12  
12  
12  
B
B'  
)
), Im  
(
)
Delay  
(
13  
13  
14  
12  
Cross Bar  
13  
13  
14  
12  
Im  
(
T), -Re  
(
W
W T  
)
Figure 4 shows the internal architecture of the 2 clock cycle  
butterfly. From the input the data flow splits into two main  
paths. The upper one computes A’=(A+B)/2 and the lower  
one B’=(A-B)xWT/2. The results from these two data paths  
are then multiplexed onto the output data bus using tristate  
buffers.  
Im(B’)=(-Re(-A+B)xIm(WT)+Im(-A+B)x-  
Re(WT))/2=Im((A-B)xWT).  
To improve performance the butterfly is more heavily pipe-  
lined than is shown in Figure 4. This results in an input to  
output latency for the butterfly of 9 cycles.  
Twiddle factor ROM  
In upper path the input data is fed into an accumulator. This  
sums it every two cycles, divides the result by two and  
rounds to 12 bits, giving A’=(A+B)/2. The result is then sent  
through a short register based delay to align it with the  
result from the slightly longer lower data path used to com-  
pute B’.  
The twiddle factor ROM is organized as 256 x 12 bit words.  
Each of the 128 complex twiddle factors is stored in the  
ROM with alternate words being used for the real and  
imaginary components.  
To improve the accuracy of the most common twiddle fac-  
tor, i.e. W0=1, the real components of the twiddle factors  
are negated in the ROM. Thus W0 is stored as –1 rather  
than 0.995; 0.995 being the closest approximation to 1 per-  
mitted by a 12 bit two’s complement fractional integer.  
In the B’ data path the input data is fed into a differencing  
circuit which computes –A+B. The real and imaginary 13 bit  
results are then fed through a cross bar switch which pre-  
sents the difference of the imaginary components (-  
Im(A)+Im(B)) to both 13x12 bit signed multipliers in the first  
cycle, and the difference of the real components (-  
Re(A)+Re(B)) in the second. In the first cycle both multipli-  
ers multiply by the imaginary component of the twiddle fac-  
tor (Im(WT)) and in the second by the negated real  
component (-Re(WT)). (Note that the reason for negating  
the real component of the twiddle factor is discussed in  
section Twiddle factor ROM.)  
The twiddle factor ROM was generated using Atmel’s  
macro generator. The input data file for the macro genera-  
tor was generated using a MatLab script.  
Data RAM  
The data RAM is configured as two 256 x 12 bit dual ported  
memories. One is used to hold the real components and  
the other to hold the imaginary components.  
The availability of dual ported, as opposed to single ported,  
memory significantly improves the utilization of on chip  
RAM. Use of single ported RAM would have required twice  
as much memory to permit the concurrent data fetches and  
stores required by the butterfly.  
The outputs of the two multipliers are therefore Im(-  
A+B)xIm(WT), Re(-A+B)x-Re(WT) and Re(-A+B)xIm(WT),  
Im(-A+B)x-Re(WT). The 25 bit results from the multipliers  
are then truncated to 14 bits. One data stream is fed into an  
accumulator which sums the results, divides by two and  
truncates the result to 12 bits with rounding, giving,  
To achieve maximum memory efficiency the FFT algorithm  
uses “in place” computation, i.e. the data stored after each  
butterfly computation is placed back in the same locations  
that the input data was read from. This requirement has no  
impact on performance, but does somewhat complicate the  
design of the data address generators.  
Re(B’)=(Im(-A+B)xIm(WT)+Re(-A+B)x  
Re(WT))/2=Re((A-B)xWT).  
The other is fed into a differencing circuit which subtracts  
the two results, divides by two and truncates the result to  
12 bits with rounding, yielding,  
AT40K-FFT  
4
AT40K-FFT  
Address Generators  
Design Capture and Layout  
The design employs two address generators to control the  
input and output of data from the data RAM. One is used to  
generate the read addresses and the other the write  
addresses. Both produce the same address pattern,  
though they are skewed in phase by 9 cycles, i.e. the  
length of the butterfly pipeline.  
The design was entered via schematic capture using Work-  
view Office. This design method was employed to provide  
good design visibility to users.  
Extensive use was made of user macros to ensure high  
performance layouts for sub components.  
The device targeted for the design was the AT40K30. Utili-  
zation of the part was as follows:  
The data address generators are based on an 8 bit accu-  
mulator where the carry out signal is connected to the carry  
in signal. The accumulator increment is provide by a  
Johnson counter which increments every 256 cycles. The  
data address generators feature tristate outputs to discon-  
nect them from the address busses while data is trans-  
ferred into or out of the chip.  
Logic Cells: 1114/1600 = 69.6%  
RAM Cells: 48/100 = 48%  
Design Simulation  
The design was simulated using the gate level simulator  
ViewSim. Both functional and post layout simulations were  
performed to confirm the correct operation of the design.  
The twiddle factor address generator produces a different  
address sequence. Here an 8 bit binary counter is  
employed, the output bits of which are ANDed with a mask-  
ing function. This masking function is provided by an 8-bit  
shift register which steps through the following pattern  
00000001b, 00000011b, 00000111b, etc., once every 256  
cycles.  
A MatLab script was employed to generated the input data  
files for the simulation. This same script was also used to  
read the results files from the simulation and check them  
against the MatLab FFT function.  
Timing Analysis  
Timing analysis of the design indicated a maximum clock  
speed of 21.2 MHz when using the AT40K30 part. Investi-  
gations into the limiting data path revealed this to be the  
delay from the twiddle factor address counter output,  
through the asynchronous twiddle factor ROM to the butter-  
fly’s twiddle factor input. Clearly, conversion of the twiddle  
ROM from asynchronous to synchronous operation would  
be a starting point for improving the design’s performance.  
Controller  
The controller synchronizes the operation of all the other  
components. It contains a state machine closely coupled to  
a counter to generate all the control signals required.  
The state machine provides a simple two line control port  
(START and BUSY) to enable external devices to initiate  
processing and monitor its completion.  
5
Design Analysis  
System performance could potentially be improved by pro-  
viding suitable buffering to permit concurrent data process-  
ing and IO. Double buffering designs would most probably  
have to use external memory devices.  
Design Performance  
The design requires Nlog2N clock cycles to compute the  
FFT, where N is the transform length. In this design N is  
fixed at 256, requiring a total of 2048 clock cycles. Timing  
analysis of the design (see section Timing Analysis) indi-  
cates a maximum clock rate of 21 MHz. This gives a total  
processing time for the FFT of 98 µs.  
The only technique to radically improve performance is to  
consider other FFT architectures, probably involving multi-  
ple FPGAs. The most probable architecture is the “pipeline  
FFT” processor described in [4]. This requires log2N butter-  
flies arranged in a line and interspersed with delay lines.  
Data is then fed continuously through the pipeline to com-  
pute the FFT. Using the current butterfly design such an  
FFT processor would require multiple FPGAs, each con-  
taining one or two butterflies and their associated line  
delays. For an transform length of N such an architecture  
would produce an log2N fold performance increase com-  
pared to the current design, (i.e. an 8 times improvement  
for a 256 point FFT.) An alternative strategy would be to  
attempt to reduce the complexity of the butterfly by using  
bit serial arithmetic, thus permitting more butterflies to be  
implemented on a single FPGA.  
It should be noted that the design does not support concur-  
rent data IO and processing. Consequently, in real applica-  
tions the user must also allow time to transfer data in and  
out. 256 clock cycles are required to write data into the  
internal RAM, yielding a minimum data input time of 12 µs  
at 21 MHz. Determination of the time required to read data  
out from the device is somewhat more difficult, partly  
because read accesses to the on chip RAM are asynchro-  
nous, and partly because the user may not require the  
complete output data set. However, it is not unrealistic to  
assume a similar data transfer time to the input. This gives  
a likely total data IO time of approximately 24 µs at 21 MHz.  
Suggested Techniques to Improve  
Performance  
Three main methods exist to increase the rate at which  
AT40K FPGAs can perform FFTs, namely:  
Recommendations to Improve Functionality  
Two potential improvements to the design are suggested:  
• Use of block floating point.  
• Reduction of the size of the twiddle factor ROM.  
• Increasing the design clock speed.  
• Double buffer data to hide data IO time.  
• Consider other FFT architectures.  
Currently the design uses fixed scaling by ½ at the output  
of the butterfly to prevent numeric overflow. However, this  
can significantly reduce the dynamic range of the output  
data when the input signals are weak. An alternative  
approach is to use a block floating point strategy [4]. In this  
case the scaling by ½ is only included in each FFT column  
of calculations if the results from the previous column are  
likely to cause overflow in the current column. This leads to  
an improvement in the dynamic range of the output data.  
The simplest method of increasing performance is to  
increase the clock frequency of the design. Timing analysis  
(see section Timing Analysis) indicates that the limiting  
path is through the asynchronous ROM block. Conversion  
of the ROM block to synchronous operation would improve  
performance somewhat.  
In general the clock rate of the design is limited by two fac-  
tors: the loading on the internal address and data busses  
and the speed of the multipliers. The first factor is affected  
by both the length of the transform and its precision.  
Decreasing these reduces the size of the dual ported mem-  
ory, leading to reduced bus loadings and increasing the  
data IO bandwidth between the butterfly and memory. The  
second factor is affected only by the precision of the trans-  
form. Increasing the precision increases the size of the  
multipliers, reducing the system’s performance. It should  
be noted that increasing either the data precision or the  
transform length will lead to a reduction in performance.  
For large transform lengths the use of an external dual  
ported memory should be considered, this may also pro-  
vide faster data transfer times by reducing on chip bus  
loadings.  
The additional logic required in the butterfly to implement a  
block floating point is a conditional divide by 2, i.e. a simple  
shifter. Inclusion of this function in the butterfly should nei-  
ther significantly increase its size or degrade its perfor-  
mance. In addition to changes to the butterfly some extra  
logic is required in the controller unit to operate the shifter.  
The overall size of the design could be reduced by investi-  
gating techniques to decrease the size of the twiddle factor  
ROM. Currently this stores half of the unit circle. By includ-  
ing logic to manipulate the input address and sign of the  
output data it should be possible to reduce the size of the  
ROM to store only a quarter or eighth of the unit circle. This  
is, however, only likely to be of significant benefit for large  
FFTs.  
AT40K-FFT  
6
AT40K-FFT  
7
Atmel Headquarters  
Atmel Operations  
Corporate Headquarters  
2325 Orchard Parkway  
San Jose, CA 95131  
TEL (408) 441-0311  
FAX (408) 487-2600  
Atmel Colorado Springs  
1150 E. Cheyenne Mtn. Blvd.  
Colorado Springs, CO 80906  
TEL (719) 576-3300  
FAX (719) 540-1759  
Europe  
Atmel U.K., Ltd.  
Atmel Rousset  
Zone Industrielle  
Coliseum Business Centre  
Riverside Way  
Camberley, Surrey GU15 3YL  
England  
13106 Rousset Cedex, France  
TEL (33) 4 42 53 60 00  
FAX (33) 4 42 53 60 01  
TEL (44) 1276-686677  
FAX (44) 1276-686697  
Asia  
Atmel Asia, Ltd.  
Room 1219  
Chinachem Golden Plaza  
77 Mody Road  
Tsimshatsui East  
Kowloon, Hong Kong  
TEL (852) 27219778  
FAX (852) 27221369  
Japan  
Atmel Japan K.K.  
Tonetsu Shinkawa Bldg., 9F  
1-24-8 Shinkawa  
Chuo-ku, Tokyo 104-0033  
Japan  
TEL (81) 3-3523-3551  
FAX (81) 3-3523-7581  
Fax-on-Demand  
North America:  
1-(800) 292-8635  
International:  
1-(408) 441-0732  
e-mail  
literature@atmel.com  
Web Site  
http://www.atmel.com  
BBS  
1-(408) 436-4309  
© Atmel Corporation 1998.  
Atmel Corporation makes no warranty for the use of its products, other than those expressly contained in the Company’s standard war-  
ranty which is detailed in Atmel’s Terms and Conditions located on the Company’s website. The Company assumes no responsibility for  
any errors which may appear in this document, reserves the right to change devices or specifications detailed herein at any time without  
notice, and does not make any commitment to update the information contained herein. No licenses to patents or other intellectual prop-  
erty of Atmel are granted by the Company in connection with the sale of Atmel products, expressly or by implication. Atmel’s products are  
not authorized for use as critical components in life support devices or systems.  
®
Marks bearing and/or are registered trademarks and trademarks of Atmel Corporation.  
Terms and product names in this document may be trademarks of others.  
Printed on recycled paper.  
1132A–08/98/15M  

相关型号:

AT40K-PCI

AT40K-PCI [Updated 6/01. 25 Pages] Periperals Component Interconnect Intellectual property core for AT40K FPGA
ETC

AT40K05

5K - 50K Gates Coprocessor FPGA with FreeRAM
ATMEL

AT40K05(LV)

AT40K05/10/20/40(LV) Summary [Updated 5/02. 4 Pages] 5K - 50K Gate FPGA with DSP Optimized Core Cell and Distributed FreeRam.
ETC

AT40K05-1AJC

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1AJI

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1AQC

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1AQI

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1BQC

Field Programmable Gate Array, 256-Cell, CMOS, PQFP144
ATMEL

AT40K05-1BQI

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1DQC

Field Programmable Gate Array (FPGA)
ETC

AT40K05-1DQI

Field Programmable Gate Array (FPGA)
ETC

AT40K05-2AJC

5K - 50K Gates Coprocessor FPGA with FreeRAM
ATMEL