As can be seen, the arithmetic unit of the full
Transcription
As can be seen, the arithmetic unit of the full
A 32 POINT MONOLITHIC FFT PROCESSOR CHIP Guy D. Covert Senior Systems Engineer TRW LSI Products La Jolla, CA 92038 ABSTRACT The Discrete Fourier Transform (DFT) is used in a wide variety of digital signal processing applications. The algorithms used to implement this transform require intensive arithmetic computation as well as complex control and sequence functions. The designer of VLSI components is faced with the problem of identifying requirements and architectures for chips which directly support the DFT. Design goals of these chips include minimum chip count to implement an entire transform, very high speed and low power dissipation. This paper discusses a monolithic CMOS device that was fabricated to perform 32 point Fast Fourier Transforms at very high data rates. All data memory and arithmetic and control circuitry is contained on this single low power chip. INTRODUCTION The TMC 2032 is a monolithic, completely self contained Fourier Transform processor which is capable of computing both forward and inverse Discrete Fourier Transforms (DFT) on 32 complex valued data samples. The device has been fabricated using a TRW proprietary 2-micron bulk CMOS process technology that offers very high circuit density and low power dissipation plus the extremely high speed operation that has previously been associated only with bipolar devices. Approximately 27,000 FET devices were used on a 236 x 248 die and the device dissipates about 900 milliwatts from a single five volt power supply. Total time required to perform a 32 TMC2032 consists of a 16 x 16 bit Multiplier Accumulator when the maximum master clock frequency of 50MHZ is used. multiply scheme. One input is connected to a sine-cosine ROM that provides the complex twiddle factors required The algorithm implemented by the TMC 2032 to factors are stored in Booth coded form so that they can Figure 1. Block Diagram of the TMC 2032 As can be seen, the arithmetic unit of the (MAC) and a separate 17 bit carry-lookahead adder. Together, these form a one-quarter butterfly circuit that, under microprogram control, is sequenced through twenty—four cycles of the master clock to complete ore full complex FFT butterfly operation every 480 nanoseconds. The MAC circuit uses a Booth coded point complex-to-complex DFT is 47.0 microseconds by both the forward and inverse transforms. These be used directly by the MAC. This resulted in a compute the OFT is an in-place decimation-in—time (DIT) butterflies per pass are then required for one complete 32 significant saving of devices in the MAC circuitry at the lesser cost of requiring a 24 bit wide ROM look up table 96dB of overall dynamic range. butterfly circuit may be right shifted up to one bit under control of an external signal. This allows scaling of data as required to prevent arithmetic overflows. Arithmetic rounding is applied to the final butterfly output by adding 0.5 to the least significant bit of each output word. FFT using radix-2 butterflies. Five passes with sixteen rather than a 16 bit width. Output from the quarter All input/output and arithmetic point transform. operations are performed with a sixteen bit, fractional two's complement fixed point numeric format that is common to many existing high speed digital signal processing systems. This format offers approximately All input/output data as well as interim results are stored in a 64 word by 16 bit RAM. This memory may DEVICE ARCHITECTURE read from one address while writing to another in a single A block diagram of the FFT processor is shown in memory cycle. A memory cycle corresponds to four cycles of the master clock. Figure 1: 1081 CH 17467/82I0OOO 1081.$ 00.75 © 1982 IEEE required. For example, if the input signal is essentially Gaussian noise, the optimum fixed scaling is usually a All control and sequence functions in the TMC 2032 are performed by a PLA based microprogrammed control right shift on every even numbered pass. unit. This unit is easily programmed by a final mask step. It cycles at the master clock rate and generates all the signals required to step through the 80 butterflies required by a 32 point transform. These signals include: A more flexible approach to data scaling requires an external circuit to monitor the state of the overflow bit and determine which passes of the FFT must be right Twiddle factor ROM addresses, RAM read/write addresses and butterfly unit states. An instruction shifted in order to prevent overflows. Non-valid data will come out of the first few FFTs, while the appropriate right—shift pattern is developed. As long as the input signal characteristics do not change significantly, the minimum shift no overflow sealing case will soon be found and valid output data will result from that point decoder circuit allows the PLA to receive and process macro level instructions via the off—chip interf ace. The off—chip interface includes separate 16 bit input and output ports, an instruction input and a status register output. All outputs have three-state buffers to give added flexibility when interfacing to bus oriented systems. Instructions to the chip include: onward. 32 POINT REAL FFTS Load complex data samples over the input! output port sequentially into the dual port RAM, then 1. The TMC2032 performs a complex to complex Fourier transform, However, in many potential perform a 32 point FFT. applications for this chip, real data only is being processed and 32 points of real data must be transformed 2. Output complex data in bit reversed addressing into sixteen complex valued frequencies. order. Here, the TMC2032 may be used to compute two real-to-complex 3. order. transforms in the same amount of time required to Output complex data in natural sequential compute a single complex-to--complex transform. The following computational procedure applies (1): 4. Load complex data and perform 32 point inverse FFT. 1. Load the first 32 real valued data points into the real array of the TMC 2032. Load the second 32 real valued data points into the imaginary array of the 5. Right shift all data values by one bit during the next sequential pass. TMC2032. 2. Execute the 32 point complex-to-complex FFT macro instruction. 6. Return status. The status register consists of five bits, Three of these indicate which of the five FFT passes is currently in progress. The fourth bit indicates that the chip is busy and the fifth bit indicates that an arithmetic overflow has occured during the current FFT pass. transform of real only data will have a real part that is DATA SCALING imaginary part and an odd real part. Therefore, the two sets of sixteen complex frequencies may be generated by simple additions and subtractions required to sort out the At this point, we must realize that the Fourier an even function of frequency and an imaginary part that is an odd function of frequency. Correspondingly, the transform of imaginary only data will have añ even even and odd parts. In the implementation of any fixed point FFT, provisions must be made for scaling data points to prevent arithmetic overflows which may be caused by normal Using the above procedure, the effective processing bandwidth of the FFT chip may be doubled when word growth within the algorithm. The TMC 2032 accom- processing real data by the addition of fairly slow add and subtract elements. Therefore, we can now transform real data with an input sample rate of up to 1.36 MHZ. plishes this scaling by use of two external signals as follows: 1. A bit is available in the status register which indicates that an arithmetic overflow occurred on the BUILDING A 1024 POINT FFT current of the five FFT passes. This signal is reset at the beginning of each pass and latched whenever an overflow occurs. The TMC2032 was designed to be used as a building block for the construction of larger size transforms. A 1024 point FFT may be constructed using 2. An instruction may be input which causes the TMC2032 to rightshift all data points by one bit before the following computational method (2): they are output from the next sequential pass. This 1. First of all, we must take the 1024 input signal is latched at the start of each pass. complex time samples and arrange them into a two dimensional matrix with the following format: The simplest application using these two signals to prevent arithmetic overflows is a fixed scaling procedure wherein an external circuit monitors the pass counter and asserts the right shift instruction in a predetermined fixed sequence. Here, the overflow bit becomes an error flag. In order to use this method effectively, some a priori knowledge about the structure of the input signal is 1082 0 1 2 3 31 32 33 34 35 63 992 - . . . 1023 Further, we will define M (M=0 through 31) as the colum index and L (L=0 through 31) as the row index. Generation of the final 1024 point transform will now be flexibility of selecting combinations of parallel and serial performed by using 32 point FFT's on the rows and structures which implement the required processing within his speed constraints. For example, maximum speed will be attained using 64 TMC2032's and the corn plex frequencies. sequenced through all 64 FFT's. minimum hardware system will use a single chip columns of this matrix then reformatting back to 1024 2. Using the TMC2032, perform a 32 point FFT on each of the 32 columns. A block diagram of one possible implementation of the 1024 point transform is given in Figure 2. Here, 16 TMC2032's are each sequenced through four FFT's to 3. Every element must now be multiplied by a complex twiddle factor depending on its location in the matrix. This factor is: compute a single 1024 point transform. Complex multiplication is performed using two multiplieraccumulator chips. Each chip is sequenced twice to generate a single complex product.Finally, a total of three frame store memories are used to store intermediate results and read them out in row or column order as required. These memories are double buffered Where M and L are the column and row indices and W is: to allow sustained rate processing. This system is capable of producing a new 1024 point FPT every 188 w = e2 7T/1024 microseconds, subject to a latency time of 752 4. Using the TMC2032, compute the 32 point FFT's of each of the 32 rows. microseconds. 5. We now have completed the 1024 point REFERENCES transform computation and, in the process, transposed the original matrix. Therefore, we must now read out our frequencies with F(0) being located at position (0,0), F(1) at (0,1), F(2) at (0,2), F(32) at (1,0) etc. (1) L. D. Enochson and R. K. Otnes "Digital Time Series Analysis" 1972. As can be seen, the above procedure requires the computation of 64 different 32 point FFT's as well as 1024 complex multiplies. The system designer has the Column L. R. Rabiner and B. Gold "Theory and Application of Digital Signal Processing" Prentice—Hall, 1975; pp. 371— (2) 379. Complex Multiply FFT's Figure 2. 1024 Point FFT Implementation 1083 易迪拓培训 专注于微波、射频、天线设计人才的培养 网址:http://www.edatop.com 射 频 和 天 线 设 计 培 训 课 程 推 荐 易迪拓培训(www.edatop.com)由数名来自于研发第一线的资深工程师发起成立,致力并专注于微 波、射频、天线设计研发人才的培养;我们于 2006 年整合合并微波 EDA 网(www.mweda.com),现 已发展成为国内最大的微波射频和天线设计人才培养基地,成功推出多套微波射频以及天线设计经典 培训课程和 ADS、HFSS 等专业软件使用培训课程,广受客户好评;并先后与人民邮电出版社、电子 工业出版社合作出版了多本专业图书,帮助数万名工程师提升了专业技术能力。客户遍布中兴通讯、 研通高频、埃威航电、国人通信等多家国内知名公司,以及台湾工业技术研究院、永业科技、全一电 子等多家台湾地区企业。 易迪拓培训课程列表:http://www.edatop.com/peixun/rfe/129.html 射频工程师养成培训课程套装 该套装精选了射频专业基础培训课程、射频仿真设计培训课程和射频电 路测量培训课程三个类别共 30 门视频培训课程和 3 本图书教材;旨在 引领学员全面学习一个射频工程师需要熟悉、理解和掌握的专业知识和 研发设计能力。通过套装的学习,能够让学员完全达到和胜任一个合格 的射频工程师的要求… 课程网址:http://www.edatop.com/peixun/rfe/110.html ADS 学习培训课程套装 该套装是迄今国内最全面、最权威的 ADS 培训教程,共包含 10 门 ADS 学习培训课程。课程是由具有多年 ADS 使用经验的微波射频与通信系 统设计领域资深专家讲解,并多结合设计实例,由浅入深、详细而又 全面地讲解了 ADS 在微波射频电路设计、通信系统设计和电磁仿真设 计方面的内容。能让您在最短的时间内学会使用 ADS,迅速提升个人技 术能力,把 ADS 真正应用到实际研发工作中去,成为 ADS 设计专家... 课程网址: http://www.edatop.com/peixun/ads/13.html HFSS 学习培训课程套装 该套课程套装包含了本站全部 HFSS 培训课程,是迄今国内最全面、最 专业的 HFSS 培训教程套装,可以帮助您从零开始, 全面深入学习 HFSS 的各项功能和在多个方面的工程应用。购买套装,更可超值赠送 3 个月 免费学习答疑,随时解答您学习过程中遇到的棘手问题,让您的 HFSS 学习更加轻松顺畅… 课程网址:http://www.edatop.com/peixun/hfss/11.html ` 易迪拓培训 专注于微波、射频、天线设计人才的培养 网址:http://www.edatop.com CST 学习培训课程套装 该培训套装由易迪拓培训联合微波 EDA 网共同推出,是最全面、系统、 专业的 CST 微波工作室培训课程套装,所有课程都由经验丰富的专家授 课,视频教学,可以帮助您从零开始,全面系统地学习 CST 微波工作的 各项功能及其在微波射频、天线设计等领域的设计应用。且购买该套装, 还可超值赠送 3 个月免费学习答疑… 课程网址:http://www.edatop.com/peixun/cst/24.html HFSS 天线设计培训课程套装 套装包含 6 门视频课程和 1 本图书,课程从基础讲起,内容由浅入深, 理论介绍和实际操作讲解相结合,全面系统的讲解了 HFSS 天线设计的 全过程。是国内最全面、最专业的 HFSS 天线设计课程,可以帮助您快 速学习掌握如何使用 HFSS 设计天线,让天线设计不再难… 课程网址:http://www.edatop.com/peixun/hfss/122.html 13.56MHz NFC/RFID 线圈天线设计培训课程套装 套装包含 4 门视频培训课程,培训将 13.56MHz 线圈天线设计原理和仿 真设计实践相结合,全面系统地讲解了 13.56MHz 线圈天线的工作原理、 设计方法、设计考量以及使用 HFSS 和 CST 仿真分析线圈天线的具体 操作,同时还介绍了 13.56MHz 线圈天线匹配电路的设计和调试。通过 该套课程的学习,可以帮助您快速学习掌握 13.56MHz 线圈天线及其匹 配电路的原理、设计和调试… 详情浏览:http://www.edatop.com/peixun/antenna/116.html 我们的课程优势: ※ 成立于 2004 年,10 多年丰富的行业经验, ※ 一直致力并专注于微波射频和天线设计工程师的培养,更了解该行业对人才的要求 ※ 经验丰富的一线资深工程师讲授,结合实际工程案例,直观、实用、易学 联系我们: ※ 易迪拓培训官网:http://www.edatop.com ※ 微波 EDA 网:http://www.mweda.com ※ 官方淘宝店:http://shop36920890.taobao.com 专注于微波、射频、天线设计人才的培养 易迪拓培训 官方网址:http://www.edatop.com 淘宝网店:http://shop36920890.taobao.com