# Low-Power High-Throughput And Low-Area Architecture For The Fir Filter Structure 

AKILA ${ }^{1}$<br>PG SCHOLAR<br>Dr.A.RAJARAM ${ }^{2}$<br>ASSOCIATE PROFESSOR<br>DEPARTMENT OF ECE<br>KARPAGAM UNIVERSITY


#### Abstract

- Based on fast finite-impulse response (FIR) algorithms (FFAs), this paper proposes new parallel FIR filter structures, which are beneficial to symmetric coefficients in terms of the hardware cost, under the condition that the number of taps is a multiple of 2 or 3. The proposed parallel FIR structures exploit the inherent nature of symmetric coefficients reducing half the number of multipliers in subfilter section at the expense of additional adders in preprocessing and postprocessing blocks. Exchanging multipliers with adders is advantageous because adders weigh less than multipliers in terms of silicon area; in addition, the overhead from the additional adders in preprocessing and postprocessing blocks stay fixed and do not increase along with the length of the FIR filter, whereas the number of reduced multipliers increases along with the length of the FIR filter. For example, for a four-parallel 72-tap filter, the proposed structure saves 27 multipliers at the expense of 11 adders, whereas for a four-parallel 576 -tap filter, the proposed structure saves 216 multipliers at the expense of 11 adders still. Overall, the proposed parallel FIR structures can lead to significant hardware savings for symmetric convolutions from the existing FFA parallel FIR filter, especially when the length of the filter is large which could be done using a Cadence virtuoso.


Index Terms-Adaptive filter, circuit optimization, distributed arithmetic (DA), least mean square (LMS) algorithm.

## INTRODUCTION

Adaptive filters are widely used in several digital signal processing applications. The tapped-delay line finiteimpulse response (FIR) filter whose weights are updated by the famousWidrow-Hoff least mean square (LMS) algorithm is the most popularly used adaptive filter not only due to its simplicity but also due to its
satisfactory convergence performance [1]. The direct form configuration on the forward path of the FIR filter results in a long critical path due to an inner-product computation to obtain a filter output. Therefore, when the input signal has a high sampling rate, it is necessary to reduce the critical path of the structure so that the critical path could not
exceed the sampling period. In recent years, the multiplier-less distributed arithmetic (DA)-based technique [2] has gained substantial popularity for its high-throughput processing capability and regularity, which result in cost-effective and area-time efficient computing structures. Hardware-efficient DA-based design of adaptive filter
has been suggested by Allred et al. [3] using two separate lookup tables (LUTs) for filtering and weight update. Guo and DeBrunner [4], [5] have improved the design in [3] by using Afinite impulse response (FIR) filter estimates the current state based on the measurements over the recent finite time horizon while an infinite impulse response (IIR) filter such as the Kalman filter utilizes all measurements from the initial time to the current time. The continuous-time FIR filter for estimating the current state can be represented as (1) where and are the current time, the fixed horizon size, and the measurements, respectively, and is a kernel function obtained in an optimal way from a certain criterion. It has been known that the FIR filter (1) has some advantages such as robustness to temporary modeling uncertainties and numerical errors [1], [2]. In addition, the FIR filter has the deadbeat property, which means that the estimated state in (1) tracks down exactly a real trajectory if noises do not appear on the horizon [2]. From such benefits, the FIR filter has been much investigated and extended even to nonlinear and hybrid systems [3], [4].
A fading memory method has been used to prevent the divergence of the Kalman IIR filter by employing the exponential data weighting [5]-[8].

## RELATED WORK



The explosive growth of multimedia application, the demand for high-performance and low-power digital signal processing (DSP) is getting higher and higher. The FIR digital filter is one of the most widely used fundamental devices performed in DSP systems, ranging from wireless communications to video and image processing. Some applications need the FIR filter to operate at high frequencies such as video processing, whereas some other applications request high throughput with a low-power circuit such as multiple-input-multiple-output systems used in cellular wireless communication. Furthermore, when narrow transition band characteristics are required, the much higher order in the FIR filter is unavoidable. In this brief, parallel processing in the digital FIR filter will be discussed. Due to its linear increase in the hardware plementation cost brought by the increase in the block size L, the parallel processing technique loses its advantage to be employed in practice. There have been a few papers proposing a simple yet efficient low power reconfigurable FIR filter architecture, where the filter order can be dynamically changed depending on the amplitude of both the filter coefficients and the inputs. In other words, when the data sample multiplied to the coefficient is so small as to mitigate the effect of partial sum in FIR filter, the multiplication operation can be simply canceled. The filter performance degradation can be minimized by controlling the error bound as small as the quantization error or signal to noise power ratio (SNR) of given system. The primary goal of this work is to reduce the dynamic power of the FIR filter, and the main contributions are summarized as follows. 1) A new reconfigurable FIR filter architecture with real-time input and coefficient monitoring circuits is presented. Since the basic filter structure is not changed, it is applicable to the FIR filter with programmable coefficients or adaptive filters. 2)We provide mathematical analysis of the power saving and filter performance degradation on the proposed approach. The analysis is verified using experimental results, and it can be used as a guideline to design low power reconfigurable filters. The rest of the
paper is organized as follows. In Section II, the basic idea of the proposed reconfigurable filter is described. Section III presents the reconfigurable hardware architectureand circuit techniques used to implement the filter

## FAST FIR ALGORITHM (FFA)

During each cycle, the LMS algorithm computes a filter output and an error value that is equal to the difference between the current filter output and the desired response. The estimated error is then used to update the filter weights in every training cycle. The weights of LMS adaptive filter during the $n$th iteration are updated according to the following equations:

$$
\mathbf{w}(n+1)=\mathbf{w}(n)+\mu \cdot e(n) \cdot \mathbf{x}(n)
$$

## Fig.1. carry save implementation

Consider a linear continuous-time state space model described by where is the state vector, is the output vector, is the system noise vector, and is the measurement noise vector. The noise might represent a deterministic error in modeling the dynamic system and is also considered as an unknown input generating state trajectories. It is assumed that the pair is observable and the system matrix is Hurwitz. We consider the following optimization problem: with , , and given measurements over . The positive definite matrices and are to be regarded as weighting matrices, which are quantitative measures of our belief in the dynamic systems and the measurement model (3), respectively. In this letter, the matrices and will be used to quantify the relative weighting with time, which will be discussed later on. In lies on the end point of the optimal trajectory generated by . If in (4) is considered as the reference signals, the optimization problem (4) is almost the same as the linear
quadratic tracking control problem with a zero boundary weighting matrix. The only difference from the control
conventional TGFF shows over 100 better error resiliency than TGFFs by the spallation neutron irradiation [7]. DTG
problem is that the minimization is taken over all possible trajectories without a fixed boundary state. In most control problems, the initial state on the horizon is given, which is not the case in (4). We would like to determine in (1) so that the cost function in (4) is minimized, and the errors and coming from the dynamic systems and the measurements.

## REDUNDANT FIR STRUCTURE

Fig. 1 shows the schematic of clock freq. It operates with the single-phase clocking scheme using pass-transistors. Without using local clock buffers, power dissipation can be reduced. As data activity becomes low, total power dissipation is drastically reduced. However, PMOS passtransistors are too weak to pass through a substantially large drain current. It is difficult to overwrite the master latch because PMOS pass-transistors are located in front of the master latch. The Adaptive-Coupled (AC) two transistors make it easy to overwrite the master latch. When the next value is same as the current value, the cross-coupled loop keeps the current value. When it is different, the AC makes the holding value weak. The number of transistors of CLOCK FREQ is fewer by two transistors than the transmission-gate (TG) FF as shown in table. To remove a SET pulse coming to master latches, a delay element such as in [12] can be used as described by dotted lines in Fig. 4. But this is vulnerable to a SET pulse produced by the C-element between master and slave latch. There are two methods to remove a SET pulse coming to slave latches. The first method is to insert delay elements in front of slave latches. But, it makes the area and delay penalties much bigger. The second method is to duplicate C-elements. The area and delay penalties are smaller than the first method. A SET pulse from is only captured by . DR FF based on the

FFs (DTG-FFs) were used and
able to achieve very low phase errors at much lower power consumption than CML. Explorative simulations in [10] confirmed that DTGFFs have significant advantages over CML FFs (CML-FFs) for MPCG. However, we would like to

## Fig. 2. Layout of Proposed structure

understand under which conditions (frequency and number of phases) this is true and how technology affects the conclusions. Although the speed, power, and power delay have been analyzed fundamentally extensively for several FF topologies (e.g., [11] and [12]), there is not much work to optimize jitter-power performance. This brief hence derives analytical equations to estimate jitter, power, and FOM for both DTG-FFs and CML-FFs. Such analytical equations are valuable for insight, to guide the initial design of FFs.

In the rate equation model, the following assumptions are considered. 1) Carrier concentration is assumed to be constant over the length of the cavity. This assumption allows the use of a simple SOA model [8] to model the laser gain section. 2) The effects of amplified spontaneous emission and residual facet reflectivities in [8] are ignored. These assumptions are valid for the typical gains required in the flip-flop and the quality of SOAs used in the experiments. In the experiments, the laser output
power due to spontaneous emission was less than $4 \%$ of the output
power at the lasing wavelength. Furthermore, the SOAs used had residual facet reflectivities approximately 400 times less than the mirror reflectivities used to form the lasers. 3) We assume that the differences between the wavelengths involved are only a small fraction of the wavelength they are centered around. Hence, we can simplify expressions by taking the energy of photons, at slightly different wavelengths, to be the same and equal to
4) The light injected into the laser experiences the same gain, guiding and internal loss as the light at the lasing wavelength inside the laser. Later, we will relax the assumption on the gain. These assumptions lead to simple analytic results and their accuracy is supported by experimental results.

## FIR FILTER MODELLING

We will now model the mismatch jitter and power consumption for an N -phase MPCG/divider implemented using DTGFFs and CML-FFs as depicted in Figs. 1 and 2 for the case $N=4$. The differential divider outputs (e.g., pair $I+, I-$ ) will be analyzed, so that a fair comparison can be made with a CMLFF that has a differential output. To provide insight, we keep the equations simple and use first-order device equations rather than the more complicated short-channel models. Evaluating (1) for an MPCG with $N$ DTG-FFs, we find FOMDTG-MPCG= $\sigma 2$ DTG-FF ( $N \cdot P \mathrm{DTG}-\mathrm{FF}+P \mathrm{DTG}-\mathrm{INBUF}) \quad$ where $\sigma 2 \mathrm{DIG}$ _FF is the mismatch-jitter variance (variation in FF delay) and $N$ is the number of phases. PDTG-FF and PDTG-INBUF are power consumptions of an FF and input clock buffer, respectively. As we aim for insight in FF design (used in an MPCG), we chose to analyze "FOM per stability.

## STEADY-STATE BEHAVIOR AND STABILITY

The steady-state solutions of the rate equations which represent stable states can be found by linearizing the flipflop system [10] at a particular steady-state solution denoted by . In a small neighborhood around, the flipflop, which is a nonlinear system, behaves like a linear system. The linearized system can be checked for stability using standard techniques [10] to determine the flip-flop stability for . However, this formal approach does not yield simple analytic expressions for crucial flip-flop properties in terms of the flip-flop parameters. Nor does it offer insight into how the flip-flop operates. To obtain simple analytic expressions and more insight into the flipflop behavior, we employ a simplified model. In the model, the lasers are represented by their steady-state photon number versus injected hoton number curves (see Figs. 2 and 3). Also in the simplified model, the light
output of the laser changes nstantaneously in accordance with the input light. A time delay of $s$ is experienced by light travelling between the lasers. In this simplified model, there are just two state variables: S1 and .S2


Fig.3. Clocking style with transistors
In addition, only one transistor being active during the transition increases the driving capability of the output stage, and prevents the crow-bar current, reducing the power dissipation. Both true and complementary outputs have the same driving strengths, which is important not only for differential logic styles, but could also effectively double the driving capability of the flip-flop even when used with standard CMOS design. The self-loading at the output of the second stage is reduced as compared to a NAND implementation. The loading at the output with the NAND cross-coupled latch is two large gate capacitances and three large drain capacitances. The new output stage has loading of two small gates and two large and two small junctions. The proposed SAFF, shown in Fig. 5, has all the advantages of earlier published SAFF's. It allows integration of the logic into the flip-flop, as well as reduced clock-swing iteration [10]. The single-ended input version with multiplexed data scan and asynchronous reset is possible as shown modulation curve of the reset state was shifted from the curve of the set state along the horizontal control current axis. This feature is shown in the figure. Since the read SQUID voltage difference between the two states had maximum values when the control current was set to 0.7 mA , we operated the RS flip-flop under the following conditions; $1, \ldots,=0.455 \mathrm{~mA}, 1, \ldots, \ldots,=0.7 \mathrm{~mA}$, set pulse $=0.286 \mathrm{~mA}$, reset pulse $=0.381 \mathrm{~mA}$. The output voltage difference between the set state and the reset state is clearly distinguished in the figure. The voltage difference between the two states was about 2 pV . The V 4 modulation curves of the read SQUID fabricated on a ground plane are shown in Fig. 9. Figure shows the both

V-@ modulation curves of the set state and the reset state. The measurements were made at 57 K with the bias current Ibseit ato 0s. 165 mA . Differently from the behavior of the read SQUID in an RS flip-flop without a ground plane, the modulation curve of the reset state was shifted from the curve of the set state along the vertical voltage axis. The vertical shift of the modulation curve may be attributed to the flux shielding effect caused by the underlying ground plane.

## REDUCED CLOCK-SWING FLIP-FLOP FOR FIR FILTER

Reduced clock-swing flip-flop (RCSFF) is proposed to lower the voltage swing of the clock system. Fig. 2 shows schematic diagrams of the conventional flip-flop and the proposed RCSFF. With the conventional flip-flop, the clock swing cannot be reduced because and are required, and overhead becomes imminent if two clock lines and are to be distributed. On the other hand, if only is distributed, most of the clock-related MOSFET's operate at full swing, and only minor power improvement is expected.

## RESULTS OF PROPOSED SCHEME

The error resilience of the FFs on the fabricated chip are measured by -particles from 3 M Bq and neutron radiations at RCNP (Research Center for Nuclear Physics) of Osaka University [17]. Fig. 12 shows the neutron beam spectrum compared with the terrestrial neutron spectrum at the ground level of Tokyo. The average accelerated factor is in this
measurement. In this work, all FFs are initialized to 0 . On the -particles irradiation, clock frequency is $0,100 \mathrm{M}, 300$ $\mathrm{M}, 800 \mathrm{M}$ and 1 GHz . When clock frequency is 0 Hz , we measured two patterns ( or 1). Flipped values are obtained very 5 min . shows this advantage for a wide output frequency range. In this case, the simulation was done for a load capacitance
of 10 fF and $r l=1$. When we change the number of phases, we can either keep the input frequency constant or the
output frequency. From (23), FOM ratios for both scenarios are plotted in Fig. 6(b) for $f o=100 \mathrm{MHz}$ and $f i$ $=4 \mathrm{GHz}$.

DC Response


Fig. 4. DC Response
DTG-FF performs better (ratio > 1). In Fig. 6(c), we compare the simulated FOM for changing FF sizes, with fixed input (at INCLK+ and INCLK - in Figs. 1 and 4) and output capacitances, and it also shows an order of magnitude better than the FOM for DTG. In this case, extra buffers have been added in the clock path when larger FF devices are used. Although the CML-FF FOM is more robust to temperature ( $\sim 5 \%$ for $-10{ }^{\circ} \mathrm{C}-85{ }^{\circ} \mathrm{C}$ ) and processariations $(\sim 15 \%)$ than the DTGFF ( $\sim 10 \%$ and $55 \%$, respectively), a big advantage remains. Therefore, for low power and jitter performance, DTG logic is preferred for wideband operation, e.g., for flexible softwaredefined
radio applications. This is because its power and FOM are automatically reduced for lower frequency [first term in (13)] whereas CML always dissipates the current that is required at the highest frequency of operation.

Transient Response


Fig. 5. Transient analysis
TABLE I:

|  | Throughput <br> (micr sec) | Area <br> $\left(\mathbf{m m}^{\mathbf{2}}\right)$ | Power <br> $(\mathbf{m W})$ |
| :---: | :---: | :---: | :---: |
| Existing [1] | 312 | 24.9 | 12 |
| Existing [2] | 357 | 20.3 | 24 |
| Existing [3] | 50 | 18.2 | 9 |
| Proposed | 330 | 16 | 7.3 |

## CONCLUSIONS

We have suggested an efficient pipelined architecture for low-power, high-throughput, and low-area implementation of adaptive filter. Throughput rate is significantly enhanced by parallel LUT update and concurrent processing of filtering operation and weightupdate operation. We have also proposed a carry-save accumulation scheme of signed partial inner products for the computation of filter output. From the synthesis results, we find that the proposed design consumes $10 \%$ less power and $25 \%$ less ADP over our previous FIR adaptive filter in average for filter lengths. Compared to the best of other existing designs, our proposed architecture provides greater times less power and 4.6 times less ADP. Offset binary coding is popularly used to
reduce the LUT size to half for area-efficient implementation of DA [2], [5], which can be applied to our design as well. This be done using a Cadence virtuoso tool.

## REFERENCE

[1] D. Krueger, E. Francom, and J. Langsdorf, "Circuit design for voltage scaling and ser immunity on a quadcore itanium processor," in Proc. ISSCC, Feb. 2008, pp. 94-95.
[2] M. Zhang, S. Mitra, T. M. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R. Shanbhag, and S. J. Patel, "Sequential element design with built-in soft error resilience," IEEE Trans VLSI Sys vol. 14, no. 12, pp. 1368-1378, Dec. 2006.
[3] B. I. Matush, T. J. Mozdzen, L. T. Clark, and J. E. Knudsen, "Areaefficient temporally hardened by design flip-flop circuits," IEEE Trans. Nucl. Sci., vol. 57, no. 6, pp. 3588-3595, Dec. 2010.
[4] Fabian Klass, Chaim Amir, Ashutosh Das, Kathirgamar Aingaran," A New Family of Semidynamic and Dynamic Flip-Flops with Embedded Logic for HighPerformance Processors" ., IEEE JOURNAL OF SOLIDSTATE CIRC, VOL. 34, NO. 5, MAY 1999
[5] J. Yuan C. Svensson, "High-speed CMOS circuit technique," IEEE J. Solid-State Circuits, vol. 24, pp. 6270, Feb. 1989.
[6] Y. Ji-Ren, I. Karlsson C. Svensson, "A true single-phase-clock dynamic CMOS circuit technique," IEEE J. Solid-State Circuits, vol. SC-22, pp. 899-901, Oct. 1987.
[7] J. Furuta, C. Hamanaka, K. Kobayashi, and H. Onodera, "A 65 nm bistable cross-coupled dual modular redundancy flip-flop capable of protecting soft errors on the c-element," in Proc. VLSI Circuits Symp., Jun. 2010, pp. 123-124.
[8] K. T. Chen, T. Fujita, H. Hara, andM. Hamada, "A $77 \%$ energy-saving 22 -transistor single-phase-clocking d-flip-flop with adaptive-coupling configuration in 40 nmcmos," in Proc. ISSCC, Feb. 2011, pp. 338-340.
[9] R. Yamamoto, C. Hamanaka, J. Furuta, K. Kobayashi, and H. Onodera,
"An area-efficient 65 nm radiation-hard dual-modular flip-flop
to avoid multiple cell upsets," IEEE Trans. Nucl. Sci., vol. 58, no. 6,
pp. 3053-3059, Dec. 2011.
[10] M. J. Gadlage, J. R. Ahlbin, B. Narasimham, B. L. Bhuva, L. W. Massengill, R. A. Reed, R. D. Schrimpf, and G. Vizkelethy, "Scaling trends in SET pulse widths in sub-100 nm bulk CMOS processes," IEEE Trans. Nucl. Sci., vol. 57, no. 6, pp. 3336-3341, Dec. 2010.
[11] V. K. Kaplunenko "Fluxon interaction in an overdamped Josephson transmission line," A_p_p l. Pb s . Letf., vol. 66, pp. 3365-3367, June 1995.
[12] S. Shambhulingaiah, L. T. Clark, T. J. Mozdzen, N. D. Hindman, S. Chellappa, and K. E. Holbert, "Temporal sequential logic hardening by design with a low power delay element," in Proc. RADECS, Sep. 2011, vol. B-6, pp. 144-149.
[13] T. Uemura, Y. Tosaka, H. Matsuyama, K. Shono, C. J. Uchibori, K.

Takahisa, M. Fukuda, and K. Hatanaka, "SEILA: Soft error immune latch formitigatingmulti-node-seu and local-clock-set," in Proc. IRPS,
May 2010, pp. 218-223.
[14] J. E. Kundsen and L. T. Clark, "An area and power efficient radiation hardened by design flip-flop," IEEE Trans. Nucl. Sci., vol. 53, no. 6, pp. 3053-3059, Dec. 2006.
[15] G. Toure, G. Hubert, K. Castellani-Coulie, S. Duzellier, and J. Portal, "Simulation of single and multinode collection: Impact on SEU occurrence
in nanometric SRAM cells," IEEE Trans. Nucl. Sci., vol. 58, no. 3, pp. 862-869, Jun. 2011.
[16] P. Dodd, A. Shaneyfelt, K. Horn, D.Walsh, G. Hash, T. Hill, B. Draper, J. Schwank, F. Sexton, and P. Winokur, "SEU-sensitive volumes in bulk and and and SOI SRAMs from first-principles," IEEE Trans. Nucl. Sci., vol. 48, no. 6, pp. 1893-1903, Dec. 2001.
[17] C. W. Slayman, "Theoretical correlation of broad spectrum neutron sources for accelerated soft error testing," IEEE Trans. Nucl. Sci., vol. 57, no. 6, pp. 31633168, Dec. 2010.
[18] G. Gasiot, M. Glorieux, S. Uznanski, S. Clerc, and P. Roche, "Experimental characterization of process corners effect on SRAM alpha and neutron soft error rates," in Proc. IRPS, 2012, pp. 3C.4.1-3C.4.5.
[19] K. M. Warren, A. L. Sternberg, J. D. Black, R. A. Weller, R. A. Reed, M. H. Mendenhall, R. D. Schrimpf, and L.W. Massengill, "Heavy ion testing and single event upset rate prediction considerations for a DICE flip-flop," IEEE Trans. Nucl. Sci., vol. 56, no. 6, pp. 3130-3137, Dec. 2009.
[20] H. Nakamura, K. Tanaka, T. Uemura, K. Takeuchi, T. Fukuda, and S. Kumashiro, "Measurement of neutroninduced single event transient
pulse width narrower than 100 ps ," in Proc. IRPS, May 2010, pp. 694-697.
[21] J. Furuta, C.Hamanaka, K. Kobayashi, and H. Onodera, "Measurement of neutron-induced SET pulse width using propagation-induced pulse shrinking," in Proc. IRPS, Apr. 2011, pp. 5B.2.1-5B.2.5.

