NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying of this document without permission of its author may be prohibited by law.

# Extending The Error Correction Capability Of Linear Codes

Ashok D. Ingle Daniel P. Siewiorek

Carnegie-Mellon University Pittsburgh, Pa.

March, 1973

This work was supported by the Advanced Research Projects Agency of the Office of the Secretary of Defense (F44620-70-C-O1O7) and is monitored by the Air Force Office of Scientific Research.

This document has been approved for public release and sale; its

distribution is unlimited.

# Extending The Error Correction Capability Of Linear Codes

# **ABSTRACT**

Linear transmission codes developed under were the assumption that the bit errors are independent and random. these codes are applied to digital circuits, this assumption no longer holds. Using the past history of the bit failures, a linear transmission code that can detect k bit failures can made to tolerate and correct upto (k-1) bit failures. classical error correction bounds are assumed , a linear the transmission code used in digital circuitry is under-utilized. single-error-correction, double-error-detection For example, the Hamming code could be used to correct up to two bit failures with some additional error correction circuitry. A simple algorithm for correcting these extra errors in linear codes is presented.

# **INTRODUCTION**

Several studies of linear transmission codes have been made over the last twenty years. These codes are designed for error detection ( to detect d errors the minimum code weight must be d+1 ), for error correction ( minimum weight 2t+1 to correct t errors ), and for error detection/correction { minimum weight t+d+1 to correct t errors and detect d errors ). [PeteW72], Most transmission codes are based assumption on the that transmission, each symbol is affected independently by noise and therefore the probability of a given error pattern depends only on the number of errors.

Transmission codes, such as the Hamming code, have been applied to digital circuitry such as memories. In such a case, specific bits of the word are directly associated with a given memory bit, driver, bus line, etc. Once a bit has taken an incorrect value, there is a much higher probability that the particular bit will be in error again due to a permanent or transient failure in its associated circuitry. An error correction algorithm can use the past history of bit failures to increase the error correction capability beyond the traditionally accepted limits for transmission codes.

#### LINEAR CODES

A bit failure wiil be considered to be any permanent or transient failure circuitry associated with in the logic that bit. Under this assumption, it will be shown that a minimum weight t code can detect as well as correct t-2 failures. in the case of noise ( which can affect the various bits of information independently ) will the classical bounds of t-1 error detection, |(t-1)/2| error correction be applicable. Thus transmission codes. when applied to digital circuits, are under-utilized when the classical bounds are assumed.

Classically, a linear code is a code space V generated by a generator matrix G. It is also the null space of the parity-check matrix H, i.e.

VH' - 0

or, V v c V,  $vH^{T} - 0$ 

For the treatment that follows assume that the minimum weight of the code is t [\*] A single error may be specified by a vector

# \* Definitions [PeteW72]

Hamming weight of a vector in a linear code is the number of nonzero symbols in the vector.

The minimum weight of a linear code is the minimum of the Hamming weights of all the nonzero vectors.

which has a single bit set to one. If we denote a vector whose leading bit is "1" and all other bits are "0" by e, then the set of alt possible single errors is the permutation ring generated by £. Mathematically, for any e,  $\in$  E, the weight of e, is 1 and for all i H j, e, \* e, . In other words E contains all distinct vectors of weight 1. A vector with a single error is thus characterised by v - v + e, . Therefore,

e,H' is called the syndrome, denoted by s and is nonzero if t > 1 .

Now, assume that there are m errors, where m<t . Then :

$$v - v + JC e$$
, for some  $v i V$ 

Therefore,

Since W(e.) - 1 , V I, and since all e.'s are distinct, W(g e.) - m . And because m<t, £ e.) / V. Hence, s t 0. This conclusively shows that we detect as many as <t-l> errors.

#### **ERROR CORRECTION**

In this section, we intend to show that by using the past history of bits in failure, one can extend the error correcting capability far beyond that predicted by the classical bounds. must modify the existing fault-model. We will now assume that the bit failures occur independently and one at a time. But since we no longer restrict our attention to arbitrary failure patterns in hitherto unfailed hardware, failure history We will assume that all failures are of the stuck-at importance. type ( This assumption can easily be waived, as we will show Therefore, once a bit has failed, it remains failed. later ). failed bit may yet be correct if the data happens to match with the stuck-at value of the bit. A bit will be said to be in error if it takes on the wrong value. Thus, at any particular point in time, we will presume the knowledge of the bits in failure and assume that the bit failure pattern may cover at most one bit outside these known failed bits. We now proceed to prove our claims starting with an example.

Consider the Hamming (8,4) code. The minimum weight, t-4. Under the assumption that no two ( new ) bit failures occur simultaneously, the first single bit failure ( characterized by, say, e, ) is uniquely specified by the syndrome, s, - e,H\ At the next check point, as the syndrome s is formed, three cases

# are possible:

- i) s indicates no error ( i.e. s-0 >. The failed bit(s) agree with their expected values.
- ii) s s, . There is a single error, characterized by e,, namely, the same as before. It can be corrected.
- iii) s / 0, s it s... Let s, s. Let e, be the vector that chacterizes the error indicated by s,.
  - a) Treat it as a single bit failure indicated by s,. Correct it and re-form s. if s 0 then it was a single bit failure, and is now corrected, otherwise it was a double error.
  - b) For double error, one of the bit failures must be at s,. Correct it, re-form s, and correct the bit indicated by s now.

Successful correction of upto 2 ( - t-2 ) errors is thus achieved [\*i

By generalization of this procedure, correction of upto (t-2) bit failures for codes with the minimum weight t>4 is

<sup>\*</sup> Note that it is possible to simplify this algorithm, by using the peculiarity of the Hamming (8,4) code, namely, that one bit of the syndrome merely is overall parity, which enables one to distinguish between even and odd errors. We chose to ignore this, however, in order to maintain the generality of the discussion.

Consider the following algorithm, which is a direct generalization of the one described above. We will now change i bits from the Known bit failures at a time, and form the syndrome. This syndrome will indicate a particular bit. "correct" that bit, and re-form the syndrome. If the syndrome is zero then the correction is over, and we will update the known bit failure pattern. Otherwise the "correction" was erroneous, we restore the bit value and proceed to check the next set of i If we have exhausted all the patterns of i bits, we will increment i, and start again. The process continues until we either correct the errors or exhaust all the known bit failures. We may restate the algorithm as follows:

Assume that m bit\* were known to have failed, where mS(t-2).

- a) Set i = 0.
- b) Change i of the known failed bitt at a time and form the syndrome «. If m (t-2) then go to step d.
- c) Change the bit indicated by t and re-form s.
- d) If s=0 then done, update the knowledge of /oiled bitt, if necessary, and EXIT.
- e) If all sets of i bits have been tried then go to step f else go to step b.
- f) Set i -".

g) If i < (t-2) then go to step 6, otherwise the fault exceeds our correction capability.

The algorithm terminates in a finite number of steps under our assumptions. However, as a safety measure, one may check for i exceeding (m+1) at step f, which may detect particular instances of multiple errors beyond the known failures. To prove that the algorithm can correct (t-2) failures:

Let m number of bits known to have failed; <math>m < (t-2).

Let k « number of bits in error in the current word; by

virtue of the assumed failure model k < (m+1).

Since the algorithm is exhaustive, it will certainly attain the combination of bits in error at some point. Therefore, one only needs to prove that it does not yield a zero syndrome for any combination of bits other than those in error.

We will consider the two cases m £ (t-3) and m - (t-2), First let m < (t-3). Consider any trial combination separately. of i bits. Let the number of bits that are changed erringly ( as the i bits are changed ) be w. Thus the number of correct changes is (i-w), and there are (i-w) bits common between k and Hence the total number of errors in the word, after the syndrome is formed and "correction" made. equals ( k+w+1 ) - ( i-w ) or k+2w+1-i. To ensure that no erroneous "correction" terminates the algorithm, the following inequality must hold.

$$< k + 2w + l - i > (t - l) \dots \{ l \}$$

By the definition of w, w < i.

Therefore, 2w - i < w.

Thus, (I) holds if

$$k + w + 1 < (t - 1)$$

Again, by definition, k and w occupy at most ( m+1 ) bits and they span disjoint sets of bits. ( See Figure I).

Hence, 
$$k + w < (m + 1)$$

Therefore, ( I) holds if

But this is true by assumption. Therefore, { I ) holds.

Now, let us take the case of m - (t-2). Since we do not attempt "correction", we need not consider the addition of an error. In other words, we know that we have reached the bound on correction of additional errors and restrict ourselves to correcting errors

Figure I: Conceptual mapping of bits in failure.



among the known ( t-2 ) failed bits. Now, k £ m - ( t-2 ). Hence, the inequality ( I ) may be rewritten as :

$$(k + 2 w - i) £ (t - 1) ... < || )$$

Again, 2w - i < w.

or, 
$$k + 2w - i \pounds k + w$$
.

Therefore, (II) holds if

$$k + w \pounds < t - 1$$
)

But now k and w span disjoit bits among m ( - t-2 ) bits.

Hence, 
$$k + w s \{t - 2\}$$

Therefore, { li ) holds.

Thus it is conclusively proved that the algorithm terminates in and only in the required correction for m S (|-3>, and hence we can correct upto (t-2) errors.

### **DISCUSSION**

The algorithm, if implemented in its present form, could be extremely time consuming. In the worst case ( k - m+1 - t-2 ), one has to go through

$$\sum_{i=0}^{t-2} {t-2 \choose i}$$

different trials before arriving at the proper correction.

Some improvement may be achieved by use of table look-up. Figure 2 may exemplify such methods. The syndrome and the known bit failures may be used to index into a table and retrieve the

required corrections. Depending on the versatility of the table, which may be function of the storage space available, the indexing may involve from a single reference to an organized search of the table.

The algorithm presented here is very simple-minded, and serves only as a tool to prove our claims. In practice, it should be improved upon. Clever codes may be designed in order to simplify and minimize the overheads involved ( In Hamming (8,4) code, for example, it is possible to reduce a step, as the overall parity bit can indicate even or odd error ).

The assumption of stuck-at faults may be relaxed, if along with the bit failures, one also stores the value the bit is stuck at. Every fault, then, can be treated as a stuck-at fault, and stored as such. If any bit differs from its stuck-at value, the fault was a transient one, and hence can be deleted from the store.

The improved correction capability may be employed in



Figure 2 : One implementation of table look-up scheme.

several ways. Consider а Hamming single-error-correction, double-error-correction code applied to a bit-sliced When an error is detected, a spare bit-slice may be switched in. This new bit-slice will contain erroneous information until every bit on the slice ( i.e. every word in the memory block ) is written into at least once. Since this "update" time may be substantial - and, in the worst case, infinite - the correction capability of the system may be seriously affected, even though additional spare bit-slices may be available. The second failure causes the memory to By extending the halt. capability, as outlined above, a bit failure can be corrected, even in the presence of a newly switched in bit-slice that may continue to yield erroneous information. **Extended** correction capability may also provide the basis for toleration of greater number of bit errors and may be a feasible alternative to dynamic switching of bit-slices.

Since the theory employed here is that of the classical linear codes, all the extentions of the theory, such as the ones to byte ( or symbol ) correction, may be similarly derived from the work presented here.

The effects of our new assumption should be studied in connection with other theoretical work, especially other codes { e.g. arithmatic codes ), and this is the direction of our

future work.

# **REFERENCES**

Pet«\*W72 Peterson, W. W., Weldon, E. J. Jr, Error-Correcting

Codes, II Ed., The MIT Press, Cambridge,

Massachusetts, 1972.