Chapter  start   Previous  page  Next  page

2.6.2   Adders

We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals.

  method 1  method 2

  G[ i ] = A[ i ] · B[ i ]  G[ i ] = A[ i ] · B[ i ](2.42)

  P[ i ] = A[ i ]   B[ i ]  P[ i ] = A[ i ] + B[ i ](2.43)

  C[ i ] = G[ i ] + P[ i ] · C[ i   1]  C[ i ] = G[ i ] + P[ i ] · C[ i   1](2.44)

  S[ i ] = P[ i ]   C[ i   1]  S[ i ] =  A[ i ]   B[ i ]   C[ i   1](2.45)

where C[ i ] is the carry-out signal from stage i , equal to the carry in of stage ( i  + 1). Thus, C[ i ] = COUT[ i ] = CIN[ i  + 1]. We need to be careful because C[0] might represent either the carry in or the carry out of the LSB stage. For an adder we set the carry in to the first stage (stage zero), C[1] or CIN[0], to '0'. Some people use delete (D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately others use C for COUT and D for CINso I avoid using any of these. Do not confuse the two different methods (both of which are used) in Eqs.  2.422.45 when forming the sum, since the propagate signal, P[ i ] , is different for each method.

Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional to n and is limited by the propagation of the carry signal through all of the stages. We can reduce delay by using pairs of "go-faster" bubbles to change AND and OR gates to fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the equations for the carry signal in two different ways:

  either  C[ i ] = A[ i ] · B[ i ] + P[ i ] · C[ i   1](2.46)

  orn  C[ i ] = (A[ i ] + B[ i ] ) · (P[ i ]' + C[ i   1]),(2.47)

where P[ i ]'  = NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain from two-input NAND gates, one per cell, using different logic in even and odd stages (Figure 2.22b):

  even stages  odd stages

  C1[ i ]' = P[ i  ] · C3[ i   1] · C4[ i   1]  C3[ i ]' = P[ i  ] · C1[ i   1] · C2[ i   1](2.48)

  C2[ i ] = A[ i  ] + B[ i  ]  C4[ i ]' = A[ i  ] · B[ i  ](2.49)

  C[ i ] = C1[ i  ] · C2[ i  ]  C[ i ] = C3[ i  ] ' + C4[ i  ]'(2.50)

 

FIGURE 2.22  The ripple-carry adder (RCA). (a) A conventional RCA. The delay may be reduced slightly by adding pairs of bubbles as shown to use two-input NAND gates. (b) An alternative RCA circuit topology using different cells for odd and even stages and an extra connection between cells. The carry chain is a fast string of NAND gates (shown in bold).

(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of Figure 2.22(b) in a datapath, with standard cells, or on a gate array.

Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a different approach. A carry-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i  ], CIN, S1[ i ], S2[ i ], COUT) has three outputs:

  S1[ i ] = CIN(2.51)

  S2[ i ] = A1[ i ]   A2[ i ]   A3[ i  ] = PARITY(A1[ i ], A2[ i ], A3[ i  ])(2.52)

  COUT = A1[ i ] · A2[ i ] + [(A1[ i ] + A2[ i ]) · A3[ i  ]] = MAJ(A1[ i ], A2[ i ], A3[ i  ])(2.53)

The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the carry from stage ( i   1). The carry in, CIN, is connected directly to the output bus S1indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The output, COUT, is the carry out to stage ( i  + 1).

A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones' complement or two's complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB  1]) as shown in Figure 2.23(c). In a CSA the carries are "saved" at each stage and shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA is constant. At the output of a CSA we still need to add the S1 bus (all the saved carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two buses, S1 and S2, in the form of the parity and majority functions.

We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA cell abut together horizontally to form a bit slice (or slice) and then the slices are stacked vertically to form the datapath.

 

FIGURE 2.23  The carry-save adder (CSA). (a) A CSA cell. (b) A 4-bit CSA. (c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit adder using CSAs with a ripple-carry adder (RCA) as the final stage. (f) A pipelined adder. (g) The datapath for the pipelined version showing the pipeline registers as well as the clock control lines that use m2.

We can register the CSA stages by adding vectors of flip-flops as shown in Figure 2.23(f). This reduces the adder delay to that of the slowest adder stage, usually the CPA. By using registers between stages of combinational logic we use pipelining to increase the speed and pay a price of increased area (for the registers) and introduce latency . It takes a few clock cycles (the latency, equal to n clock cycles for an n -stage pipeline) to fill the pipeline, but once it is filled, the answers emerge every clock cycle. Ferris wheels work much the same way. When the fair opens it takes a while (latency) to fill the wheel, but once it is full the people can get on and off every few seconds. (We can also pipeline the RCA of Figure 2.20. We add  i  registers on the A and B inputs before ADD[ i ] and add ( n  i ) registers after the output S[ i ], with a single register before each C[ i ].)

The problem with an RCA is that every stage has to wait to make its carry decision, C[ i ], until the previous stage has calculated C[ i   1]. If we examine the propagate signals we can bypass this critical path. Thus, for example, to bypass the carries for bits 47 (stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a MUX as follows:

  C[7] = (G[7] + P[7] · C[6]) · BYPASS' + C[3] · BYPASS.(2.54)

Adders based on this principle are called carry-bypass adders ( CBA ) [Sato et al., 1992]. Large, custom adders employ Manchester-carry chains to compute the carries and the bypass operation using TGs or just pass transistors [Weste and Eshraghian, 1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC adder cell, but are not used by ASIC designers.

Instead of checking the propagate signals we can check the inputs. For example we can compute SKIP = (A[ i   1]   B[ i   1])  + (A[ i ]   B[ i ] ) and then use a 2:1 MUX to select C[ i ]. Thus,

  CSKIP[ i ] = (G[ i ] + P[ i ] · C[ i   1]) · SKIP' + C[ i   2] · SKIP.(2.55)

This is a carry-skip adder [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961]. Carry-bypass and carry-skip adders may include redundant logic (since the carry is computed in two different wayswe just take the first signal to arrive). We must be careful that the redundant logic is not optimized away during logic synthesis.

If we evaluate Eq. 2.44 recursively for i     = 1, we get the following:

  C[1]= G[1] + P[1] · C[0] = G[1] + P[1] · (G[0] + P[1] · C[1])

   = G[1] + P[1] · G[0].(2.56)

This result means that we can "look ahead" by two stages and calculate the carry into the third stage (bit 2), which is C[1], using only the first-stage inputs (to calculate G[0]) and the second-stage inputs. This is a carry-lookahead adder ( CLA ) [MacSorley, 1961]. If we continue expanding Eq. 2.44, we find:

  C[2]= G[2] + P[2] · G[1] + P[2] · P[1] · G[0],

  C[3]= G[3] + P[2] · G[2] + P[2] · P[1] · G[1] + P[3] · P[2] · P[1] · G[0].(2.57)

As we look ahead further these equations become more complex, take longer to calculate, and the logic becomes less regular when implemented using cells with a limited number of inputs. Datapath layout must fit in a bit slice, so the physical and logical structure of each bit must be similar. In a standard cell or gate array we are not so concerned about a regular physical structure, but a regular logical structure simplifies design. The BrentKung adder reduces the delay and increases the regularity of the carry-lookahead scheme [Brent and Kung, 1982]. Figure 2.24(a) shows a regular 4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).

 

FIGURE 2.24  The BrentKung carry-lookahead adder (CLA). (a) Carry generation in a 4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3]. (c) Cells L1, L2, and L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that is lost in the translation. (d) and (e) Simplified representations of parts a and c. (f) The lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and carry terms formed from the inputs to the adder. (g) An 8-bit BrentKung CLA. The outputs of the lookahead logic are the carry bits that (together with the inputs) form the sum. One advantage of this adder is that delays from the inputs to the outputs are more nearly equal than in other adders. This tends to reduce the number of unwanted and unnecessary switching events and thus reduces power dissipation.

In a carry-select adder we duplicate two small adders (usually 4-bit or 8-bit addersoften CLAs) for the cases CIN = '0' and CIN = '1' and then use a MUX to select the case that we needwasteful, but fast [Bedrij, 1962]. A carry-select adder is often used as the fast adder in a datapath library because its layout is regular.

We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit adder, for example, into three blocks. The delay of the adder is then partly dependent on the delays of the MUX between each block. Suppose the delay due to 1-bit in an adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting the block size reduces the delay of large adders (more than 16 bits).

We can extend the idea behind a carry-select adder as follows. Suppose we have an n -bit adder that generates two sums: One sum assumes a carry-in condition of '0', the other sum assumes a carry-in condition of '1'. We can split this n -bit adder into an i -bit adder for the i LSBs and an ( n    i )-bit adder for the n    i MSBs. Both of the smaller adders generate two conditional sums as well as true and complement carry signals. The two (true and complement) carry signals from the LSB adder are used to select between the two ( n    i  + 1)-bit conditional sums from the MSB adder using 2( n    i  + 1) two-input MUXes. This is a conditional-sum adder (also often abbreviated to CSA) [Sklansky, 1960]. We can recursively apply this technique. For example, we can split a 16-bit adder using i  = 8 and n  = 8; then we can split one or both 8bit adders againand so on.

Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n single-bit conditional adders, H (each with four outputs: two conditional sums, true carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The conditional-sum adder is usually the fastest of all the adders we have discussed (it is the fastest when logic cell delay increases with the number of inputsthis is true for all ASICs except FPGAs).

 

FIGURE 2.25  The conditional-sum adder. (a) A 1-bit conditional adder that calculates the sum and carry out assuming the carry in is either '1' or '0'. (b) The multiplexer that selects between sums and carries. (c) A 4-bit conditional-sum adder with carry input, C[0].


Chapter  start   Previous  page  Next  page