Chapter start Previous page Next page

We can view addition
in terms of **generate** , G[ i
], and **propagate** , P[ i ],
signals.

G[ i ] = A[ i ] · B[ i ] G[ i ] = A[ i ] · B[ i ](2.42)

P[ i ] = A[ i ] B[ i ] P[ i ] = A[ i ] + B[ i ](2.43)

C[ i ] = G[ i ] + P[ i ] · C[ i 1] C[ i ] = G[ i ] + P[ i ] · C[ i 1](2.44)

S[ i ] = P[ i ] C[ i 1] S[ i ] = A[ i ] B[ i ] C[ i 1](2.45)

where C[
i ] is the carry-out signal from stage
i , equal to the carry in of stage (
i + 1).
Thus, C[ i ] = COUT[
i ] = CIN[
i + 1].
We need to be careful because C[0] might represent either the carry in or
the carry out of the LSB stage. For an adder we set the carry in to the
first stage (stage zero), C[1] or CIN[0], to '0'. Some people use **delete**
(D) or **kill** (K) in various ways for the complements of G[i] and P[i],
but unfortunately others use C for COUT and D for CINso I avoid using any
of these. Do not confuse the two different methods (both of which are used)
in Eqs. 2.422.45 when forming the sum, since the propagate signal,
P[ i ] ,
is different for each method.

Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional to n and is limited by the propagation of the carry signal through all of the stages. We can reduce delay by using pairs of "go-faster" bubbles to change AND and OR gates to fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the equations for the carry signal in two different ways:

either C[ i ] = A[ i ] · B[ i ] + P[ i ] · C[ i 1](2.46)

orn C[ i ] = (A[ i ] + B[ i ] ) · (P[ i ]' + C[ i 1]),(2.47)

where P[ i ]' = NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain from two-input NAND gates, one per cell, using different logic in even and odd stages (Figure 2.22b):

C1[ i ]' = P[ i ] · C3[ i 1] · C4[ i 1] C3[ i ]' = P[ i ] · C1[ i 1] · C2[ i 1](2.48)

C2[ i ] = A[ i ] + B[ i ] C4[ i ]' = A[ i ] · B[ i ](2.49)

C[ i ] = C1[ i ] · C2[ i ] C[ i ] = C3[ i ] ' + C4[ i ]'(2.50)

(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of Figure 2.22(b) in a datapath, with standard cells, or on a gate array.

Instead of propagating the
carries through each stage of an RCA, Figure 2.23 shows a different
approach. A **carry-save adder** (** CSA** ) cell CSA(A1[
i ], A2[ i ], A3[
i ], CIN,
S1[ i ], S2[
i ], COUT) has three outputs:

S2[ i ] = A1[ i ] A2[ i ] A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ])(2.52)

COUT = A1[ i ] · A2[ i ] + [(A1[ i ] + A2[ i ]) · A3[ i ]] = MAJ(A1[ i ], A2[ i ], A3[ i ])(2.53)

The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the carry from stage ( i 1). The carry in, CIN, is connected directly to the output bus S1indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The output, COUT, is the carry out to stage ( i + 1).

A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones' complement or two's complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB 1]) as shown in Figure 2.23(c). In a CSA the carries are "saved" at each stage and shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA is constant. At the output of a CSA we still need to add the S1 bus (all the saved carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two buses, S1 and S2, in the form of the parity and majority functions.

We can use a CSA to add multiple
inputsas an example, an adder with four 4-bit inputs is shown in Figure 2.23(d).
The last stage sums two input buses using a **carry-propagate adder**
(** CPA** ). We have used an RCA as the CPA in Figure 2.23(d) and
(e), but we can use any type of adder. Notice in Figure 2.23(e) how
the two CSA cells and the RCA cell abut together horizontally to form a
**bit slice** (or slice) and then the slices are stacked vertically to
form the datapath.

We can register the CSA stages
by adding vectors of flip-flops as shown in Figure 2.23(f). This reduces
the adder delay to that of the slowest adder stage, usually the CPA. By
using registers between stages of combinational logic we use **pipelining**
to increase the speed and pay a price of increased area (for the registers)
and introduce **latency** . It takes a few clock cycles (the latency,
equal to n clock cycles for an
n -stage pipeline) to fill the pipeline, but once it is filled, the
answers emerge every clock cycle. Ferris wheels work much the same way.
When the fair opens it takes a while (latency) to fill the wheel, but once
it is full the people can get on and off every few seconds. (We can also
pipeline the RCA of Figure 2.20. We add
i registers on the A and B inputs
before ADD[ i ] and add (
n i
) registers after the output S[ i
], with a single register before each C[
i ].)

The problem with an RCA is that every stage has to wait to make its carry decision, C[ i ], until the previous stage has calculated C[ i 1]. If we examine the propagate signals we can bypass this critical path. Thus, for example, to bypass the carries for bits 47 (stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a MUX as follows:

C[7] = (G[7] + P[7] · C[6]) · BYPASS' + C[3] · BYPASS.(2.54)

Adders based on this principle
are called **carry-bypass adders** (** CBA** ) [Sato et al., 1992].
Large, custom adders employ **Manchester-carry chains** to compute the
carries and the bypass operation using TGs or just pass transistors [Weste
and Eshraghian, 1993, pp. 530531]. These types of carry chains may
be part of a predesigned ASIC adder cell, but are not used by ASIC designers.

Instead of checking the propagate signals we can check the inputs. For example we can compute SKIP = (A[ i 1] B[ i 1]) + (A[ i ] B[ i ] ) and then use a 2:1 MUX to select C[ i ]. Thus,

CSKIP[ i ] = (G[ i ] + P[ i ] · C[ i 1]) · SKIP' + C[ i 2] · SKIP.(2.55)

This is a **carry-skip
adder** [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961]. Carry-bypass
and carry-skip adders may include redundant logic (since the carry is computed
in two different wayswe just take the first signal to arrive). We must be
careful that the redundant logic is not optimized away during logic synthesis.

If we evaluate Eq. 2.44 recursively for i = 1, we get the following:

C[1]= G[1] + P[1] · C[0] = G[1] + P[1] · (G[0] + P[1] · C[1])

This result means that we can
"look ahead" by two stages and calculate the carry into the third
stage (bit 2), which is C[1], using only the first-stage inputs (to calculate
G[0]) and the second-stage inputs. This is a **carry-lookahead adder**
(** CLA** ) [MacSorley, 1961]. If we continue expanding Eq. 2.44,
we find:

C[2]= G[2] + P[2] · G[1] + P[2] · P[1] · G[0],

C[3]= G[3] + P[2] · G[2] + P[2] · P[1] · G[1] + P[3] · P[2] · P[1] · G[0].(2.57)

As we look ahead further these equations become more complex, take longer to calculate, and the logic becomes less regular when implemented using cells with a limited number of inputs. Datapath layout must fit in a bit slice, so the physical and logical structure of each bit must be similar. In a standard cell or gate array we are not so concerned about a regular physical structure, but a regular logical structure simplifies design. The BrentKung adder reduces the delay and increases the regularity of the carry-lookahead scheme [Brent and Kung, 1982]. Figure 2.24(a) shows a regular 4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).

In a **carry-select adder **we
duplicate two small adders (usually 4-bit or 8-bit addersoften CLAs) for
the cases CIN = '0'
and CIN = '1'
and then use a MUX to select the case that we needwasteful, but fast [Bedrij,
1962]. A carry-select adder is often used as the fast adder in a datapath
library because its layout is regular.

We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit adder, for example, into three blocks. The delay of the adder is then partly dependent on the delays of the MUX between each block. Suppose the delay due to 1-bit in an adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting the block size reduces the delay of large adders (more than 16 bits).

We can extend the idea behind
a carry-select adder as follows. Suppose we have an
n -bit adder that generates two sums: One sum assumes a carry-in
condition of '0', the other sum assumes a carry-in condition of '1'. We
can split this n -bit adder into
an i -bit adder for the
i LSBs and an ( n
i )-bit adder for the n
i MSBs. Both of the smaller adders generate two conditional sums
as well as true and complement carry signals. The two (true and complement)
carry signals from the LSB adder are used to select between the two (
n
i + 1)-bit
conditional sums from the MSB adder using 2(
n
i + 1)
two-input MUXes. This is a **conditional-sum adder** (also often abbreviated
to CSA) [Sklansky, 1960]. We can recursively apply this technique. For example,
we can split a 16-bit adder using i
= 8
and n = 8;
then we can split one or both 8bit adders againand so on.

Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n single-bit conditional adders, H (each with four outputs: two conditional sums, true carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The conditional-sum adder is usually the fastest of all the adders we have discussed (it is the fastest when logic cell delay increases with the number of inputsthis is true for all ASICs except FPGAs).