National Conference of Standards Laboratories

Workshop & Symposium

Man views phenomena as random when he does not understand the underlying mechanisms. The emphasis on statistical tools for uncertainty estimation and lack of knowledge of physics drives our focus towards the apparent random properties of errors.

The paper demonstrates the arbitrary nature of the distinction between systematic and
random error. It proposes that there is no reason to believe that any error is random.
Finally it concludes that a thorough analysis of the mechanisms that govern variations in
measurements integrated into the GUM^{1} method can yield not only an estimate of
the uncertainty, but can also help improve it.

Traditionally we have divided errors into systematic and random components. Anything we
could explain, such as a temperature influence, as well as errors that followed a certain
pattern and *looked* systematic were characterized as systematic errors. Anything
else was considered random errors.

This allowed us to use statistical tools to predict certain aspects of the behavior of the random component of the error. We could find the standard deviation to describe the magnitude of the error and we could perform F-tests or t-tests to convince ourselves that the error was indeed random.

The fact we ignored, but which was there all along, was that the harder we looked at a measuring process and the more resources we put into understanding it, the more errors started appearing systematic to us.

In this paper we will look at the errors found in one measuring process and show how
they can be interpreted using different tools. We will see that the only logical
explanation is that all errors are systematic, they only *appear* random when we
have limited information or if our sampling is not dense enough.

The measuring process we are considering is that of measuring a two-point size. Table 1 gives the value of the observed deviation from nominal size in microns for 60 individual measurements.

Observation |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

0 | 10 | 10 | 10 | 9 | 9 | 10 | 11 | 12 | 12 | 11 |

10 | 10 | 10 | 11 | 12 | 12 | 12 | 11 | 10 | 11 | 11 |

20 | 12 | 12 | 11 | 10 | 10 | 10 | 11 | 11 | 11 | 10 |

30 | 9 | 9 | 9 | 10 | 10 | 10 | 8 | 8 | 8 | 8 |

40 | 9 | 9 | 8 | 7 | 7 | 7 | 8 | 9 | 9 | 8 |

50 | 7 | 7 | 8 | 9 | 9 | 9 | 8 | 7 | 8 | 8 |

**Table 1:** Observed deviation from nominal size in
micrometers.

There are different techniques that can be used to find the standard deviation of the
sample. The traditional Gage Repeatability and Reproducibility (GR&R) study out of the
Measurement Systems Analysis Reference Manual^{2}, for example uses the ranges of
each subset of the observations to derive it. Had the 60 observations represented 2
measurements of each of 10 parts by 3 different observers, the 6 "repeatability"
or "instrument error" would have been assessed to be 7.8 µm and the
"reproducibility" or "observer error" would have been assessed to be
8.1 µm, yielding a total GR&R of 11.3 µm. This is the value used by the automotive
industry as a measure for how capable a measuring process is.

Note that the true value never enters into this analysis, we are purely using the instrument's ability to yield consistent values as a measure of goodness.

If our analysis was a little more sophisticated and we used calibrated parts for our experiment, then we can look at the measurement error by subtracting the calibrated size, X from the observed value Y to get the error E.

E=Y-X

If the values we saw in table 1 represented the measurement error - the deviation from the calibrated value - then we can find the average value of the error to be 9.5 µm, which is what we traditionally would have called the systematic error.

Based on this analysis we now have a systematic error (9.5 µm) and a random error (11.3 µm). We would like an overall measure for how wrong our measurement can be. One technique for doing this is to use the formula:

W=B+3

Where W is "how wrong we can be" (a measure conceptually equivalent to uncertainty), B is the systematic error or bias and 3 is one half of the GR&R value.

This gives us W= 15.2 µm as a measure for how wrong we can be.

If we look at the same measuring process over a little longer timescale to evaluate the drift or stability of the process, we may use a series of observations over a 24-hour period. These observations may look as follows:

**Figure 1:** Observed values over a 24 hour period.

To analyze this data, we use engineering judgement to interpret what the data means. A typical way of doing this is to draw some arbitrary smooth curve through the data and decide that this line represents the systematic error. The deviation between each data point and the curve then becomes the random error.

The curve in figure 1 is such a curve. It is fitted by making the judgement that there is something inexplicable wrong with the 3rd and 6th data points and then fitting a sine curve through the rest of the data points.

We can then take the difference between the curve and each data point to find what we consider the random error in this model. This is shown in figure 2. If we again disregard the 3rd and 6th data points, then we find a standard deviation of 1.24 µm for the rest of the population. If we again focus on the 6 value, we find it to be 7.44 µm.

So all in all we have a systematic error that varies roughly between +11 µm and -9 µm, a random error of 7.44 µm and two inexplicable events that do not fit the model. We usually refer to these as fliers or outliers. Adding 3 of the random error (one side of the distribution) to the worst case systematic error to give us a 99.97% worst case uncertainty (disregarding the fliers and assuming that the rest of the observations represent a normal distribution), we get an uncertainty value of 14.72 µm.

**Figure 2:** The random error, interpreted as the
difference between each individual observation and the fitted line representing the
systematic error.

What is important to notice is that although we have put a lot of numbers around our
model and made a lot of analysis, we still do not even begin to understand the cause of
the variation we are seeing. We do not know if the variation we are seeing is
representative for what the measuring system will do over time, nor do we know if it is
liable to change and if so, what will make it change. What we have come up with is a *pseudo
explanation* of our measurement system.

The GUM approach^{1,3} is more analytical than either of the previous
approaches. It starts with an analysis of the variations in the influence factors that are
the *root causes of the variation* in the measuring system and propagates those
variations through the laws of physics into variations that can be observed in the values
measured by the measuring system.

Assume we are measuring the diameter of parts coming off a production machine. The parts come off the machine at a more or less constant temperature, since they are immersed in cutting fluid. The production machine runs 24 hours per day. We measure the parts on a gage, which is sitting off to the side of the machine by a west facing window. If we want to use the GUM approach to estimate the uncertainty of that measuring process, we have to start by identifying the factors which may cause variation in the results of the measurements.

We identify these factors as the following:

- Gage Temperature Variation
- Workpiece Temperature Variation
- Operator influence
- Digitization

The temperature is the limiting factor in most dimensional measurements. In this case we expect to see the highest temperature in the gage during the afternoon/early evening as the sun is on the gage and the overall temperature in the shop in rising. When the temperature of the gage is high and the temperature of the workpieces is constant, the gage will see the workpieces as being smaller than they really are.

If the temperature of the cutting fluid that determines the temperature of the workpieces is equal to the average temperature of the gage, then we can model the influence caused by the variation in gage temperature over a 24-hour period as a sinewave.

Figure 3 shows that difference "translated" into microns using the laws of
thermal expansion. The amplitude of the sinewave is 10 µm. This corresponds to a
temperature variation of about +/- 5^{ o}C if the part diameter is 200 mm.

**Figure 3:** Gage temperature influence.

The part temperature is to a large extent governed by the temperature of the cutting fluid. In this particular situation, the fluid comes from a central reservoir in the plant and is shared by a number of machines, not all of which run 24 hours per day. Since the volume of fluid is large, the temperature of the fluid changes only slowly and a sinewave is once again a good model for the variation.

Figure 4 shows the variation in cutting fluid temperature over a 24-hour period
"translated" into microns using the laws of thermal expansion. The amplitude of
the sinewave is 2 µm. This corresponds to a temperature variation of about +/- 1^{ o}C
if the part diameter is 200 mm.

**Figure 4:** Part temperature influence, the part
temperature is governed by the temperature of the cutting fluid.

The operator influence is in many cases the hardest one to quantify. It is also the one which is hardest to be honest to ourselves about. If we are studying a measuring system and we encounter a bad reading, we want to disregard it, rationalizing that it will not happen to a well-trained operator in the real measuring situation.

This is of course a self-deception. Misreading of gages happen at least as often in production measurements as they do during gage studies, we just do not know about them.

If we try to model the operator influence, including the little variations he causes by the way he puts the parts in the gage and by an occasional misreading of the gage, the resulting influence may be as shown in figure 5. The variations are modeled as a sinewave. This should not be taken to imply that this is a typical shape for this kind of variation. Rather, it is rarely a function that can be described by a simple equation, but for purposes of illustrating the magnitude of this influence relative to the other ones the simple sinewave model is used.

An off-set is included in the influence, modeling a slight difference in the way the operator uses the gage, from the way the laboratory technician, who sets up and calibrates the gage, uses it.

Finally one bad reading is included in the operator influence.

Figure 5 shows the operator influence over a 24-hour period. The influence is shown continuous, indicating: "If the operator was measuring at this time, this would be his influence." Although the operator is not measuring continuously, it is easy to envision that the operator off-set will have a finite value at any given time during the 24-hour period.

**Figure 5:** Operator influence, contains the combined
effect of a +/- 1 µm variation, an off-set of 0.1 µm and a scale misreading of 8 µm.

The effect of limited resolution, be it in the form of a digital display or the ability of the operator to resolve the scale, is interrelated with the operator influence as well as the other effects. In this example a resolution of 1 µm is used. The influence of the resolution is calculated by adding up all the other influences and rounding it off to the closest micron.

As for the operator influence, the effect of the limited resolution only comes into play when a measurement takes place, but it can be envisioned that the rounding error will have a value at any given time, if a measurement took place at that time.

Figure 6 shows the value of the digitization/resolution influence over time.

**Figure 6:** Rounding error. Since the resolution is 1
µm, the rounding error varies between +/- 0.5 µm

We get the total error by superimposing all the influences on one another. The result of this is shown in figure 7. In the normal situation, we do not know what the error is. Otherwise it would be easy to correct for it. Instead we use uncertainty statements to characterize the nature and the magnitude of the error.

**Figure 7:** The total error consists of all the
individual error components superimposed on each other.

If we apply the GUM uncertainty estimation method in this case, using the format
recommended in ISO/TR 14253-2^{3 }we can sum up our analysis in table 2.

Contributor | Evaluation Type | Distribution Type | Number of Measurements | Variation Limit, a [µm] | Variation Limit, a [Influence Unit] | Correlation Coefficient | Distribution Factor | Uncertainty Component |
---|---|---|---|---|---|---|---|---|

Gage Temperature | B | U-shaped | 10 µm | 5^{o}C |
0 | 0.7 | 7 µm | |

Part Temperature | B | U-shaped | 2 µm | 1^{o}C |
0 | 0.7 | 1.4 µm | |

Operator Influence | B | U-shaped | 1.1 µm | 1.1 µm | 0 | 0.7 | 0.77 µm | |

Digitization/Rounding | B | Step | 1µm | 1µm | 0 | 0.3 | 0.3 µm | |

Combined Uncertainty(square root of the sum of the
squares of the uncertainty components) |
7.2 µm |
|||||||

Expanded Uncertainty (the Combined Uncertainty multiplied
by k=2) |
14.4 µm |

**Table 2:** GUM uncertainty budget summarizing the
analysis of the measuring process.

The GUM analysis results in an Expanded Uncertainty of 14.4 µm. It disregards the outlier and takes the slightly conservative approach, modeling the operator influence as a 1.1 µm variation rather than a 0.1 µm offset and a 1 µm variation.

Having evaluated the same measuring situation several different ways, we can now compare the different approaches.

First it is important to understand that we have been using the same data in all the estimations.

Figure 8 shows how the data for the repeatability study and the stability evaluation were taken from the total error function that was generated based on the influences discussed in the GUM analysis.

**Figure 8:** The data for the GR&R study and the
stability study, taken from the Total Error for the measuring process.

When we see this, we can make several observations, that illustrate how inadequate both the GR&R study and the 24-hour stability study are in terms of their ability to properly analyze a measuring process.

The results of the studies are given in table 3.

Error/Uncertainty | ||||
---|---|---|---|---|

Random | Systematic | Total | ||

GR&R | Instrument | Operator | 11.3 µm | |

7.8 µm | 8.1 µm | |||

GR&R + Bias | 11.3 µm | 9.5 µm | 15.2 µm | |

Stability | 7.4 µm | 11 µm | 14.7 µm | |

GUM | 14.4 µm | 14.4 µm |

**Table 3:** Components of Error/Uncertainty as evaluated by the different
methods. GR&R, GR&R+Bias and the Stability study all evaluate errors, whereas the
GUM method evaluates uncertainty.

The range of values in the underlying data set is -12 µm to +12 µm, except the outlier which is 18 µm. There are no values outside this range. This underlines the fact that none of the underlying effects follow a normal distribution. If they did, the distribution would be unlimited. In practice, as in this example, we never see unlimited distributions, where the more data points we consider, the wider the range. We always see the range grow to a finite size that is governed by the underlying effects.

*We see that the random errors we found in the GR&R, GR&R+Bias and the
Stability studies are but a myth. All the variation is due to underlying systematic
effects. The only random occurrence is the outlier, where the operator misread the gage.*

As stated above, the operator influence is somewhat unrealistically modeled, but even when the usual random appearance is modeled more faithfully, it is still clear that the only thing that may be random is the variation of the operator's actions. The response of the measuring system and the workpiece to the operator's actions is fully systematic, e.g., the higher the measuring force, the larger the elastic deformation of the workpiece.

It is only as long as we do not understand these underlying systematic effects that the variation appears random to us. As soon as we understand the underlying effects, the random semblance disappears.

Taken in its purest form, the GR&R study tells us only how much the measuring results vary during the short period of time the study takes. The 11.3 µm repeatability is not related to the level or nature of the errors that we are trying to analyze. The GR&R study does not know the full extent of the variations we see. It is based only on data varying between +7 µm and +12 µm - a range of 5 µm.

Even when we enhance the GR&R study and investigate the bias of the measurement, we can only see what the process is doing at the particular time when we do our study.

We can also see that terms like operator error and instrument error become meaningless, when the major error source is coming from outside the measurement process as in this case, where the ambient temperature is the main factor.

Looking at figure 8 we find that we would have gotten different results at different times during the day. If we had carried out our study for example from 11:00 to 15:00 instead of from 7:00 to 11:00, we would have seen a higher GR&R value, but a lower bias, so even the relationship between the two is not fixed.

The largest problem, however, is that the GR&R study does not help us understand the measuring process. It only provides us with a couple of numbers to characterize the process. Neither the Repeatability (Operator Error), Reproducibility (Instrument Error) nor the Bias tells us what their names are suggesting. Furthermore, they do not help us understand the measuring process in such a way that we can improve the uncertainty of it or find lower cost ways of achieving the same uncertainty.

The stability study is doing better than the GR&R study in the example, because all the influences happen to go through their full cycle within the 24-hour period of the study. There is no guarantee that this will be the case, when we set out to do the study, so we are not guaranteed this benefit.

The study arbitrarily concludes that there is something wrong with two of the measuring points. As we see when we look at the full data set, one of the points is indeed an outlier, but the other is very much part of the underlying distribution.

The study uses a quite arbitrary distinction between what is considered systematic and what is considered random. While more completely describing the variations than the GR&R study in this case, it still has the fundamental shortcoming, that it is unable to tell us how to improve the measuring process.

A strength and a weakness of the GUM method are that it is based as much on theoretical analysis as it is on actual measurements.

It is a strength, because it allows us to include influences in our analysis, which are hard or cost prohibitive to determine experimentally, such as seasonal variations.

It is a weakness, because it requires the person who is doing the analysis to successfully identify at least all the major contributors to the uncertainty of a measuring process. If a contributor is not identified, then its influence is not reflected in the uncertainty budget and if it is a major contributor, it may invalidate the whole uncertainty budget. If, for example, in the study conducted above, we overlooked the influence from the gage temperature, we would find an expanded uncertainty of 3.2 µm, rather than the 14.4 µm we find when we include this influence. If, on the other hand, it was the operator influence we overlooked, then we would find an expanded uncertainty of 14.3 µm, because this influence is so much smaller than the dominating influence.

The major strength of the GUM method is, however, that it provides an understanding of the contributors, that cause the uncertainty of the measurement process and their relative magnitude. None of the other studies does that.

We find in our example that if we want to improve the measuring uncertainty, we either have to improve the control of the gage temperature, or we have to measure it and correct for it, since it is by far the largest contributor. Changing any of the other contributors will not have any appreciable effect until the gage temperature is under tighter control.

A limitation of the method is that it is based on assumed distributions. The estimate of the measuring uncertainty cannot be any better than these assumptions. Fortunately, the error we commit when assuming the wrong distribution, generally changes the effect of the contributor no more than 15 - 20 % and it is possible to always err on the safe side, by always choosing the most conservative of the two considered distributions, when in doubt. As we see, it is only the largest contributor for which it is critical to assume the correct distribution, since a 15 - 20 % change in the influence of any other contributor would be negligible.

An interesting observation about the GUM method is, that the generally accepted value
for the coverage factor k=2, which is designed to provide a confidence level of *no
less than* 95 %, actually covers more than 100 % of the deviations in cases where:

- The influences of the contributors have been correctly estimated.
- The distribution of the major contributor is rectangular or U-shaped.

I suggest that these are the majority of at least dimensional metrology cases, where temperature is the major contributor, which means that in these cases the GUM method overestimates the uncertainty.

Using an example, which I propose is fairly typical for dimensional measuring processes, I show how different approaches to evaluating the goodness of the measuring process yield very different results and leads the examiner to draw very different conclusions about the measuring process.

I show that when a number of systematic effects are superimposed on one another and evaluated using statistical tools, the resulting variations can easily be interpreted as displaying random characteristics, passing various tests for normal distributions, when observed for a limited time.

Finally I show that when the underlying mechanisms causing the variation are understood, the apparently random variation loses its random appearance and can be shown to be systematic and governed by the laws of physics.

Thus I conclude that, unless our measurement approach the atomic level of resolution, where additional effects come into play, which are beyond the scope of this paper, the variation we observe is systematic in nature and the concept of a random error is but a myth, fueled by our inadequate analysis of the root causes of the variations we see.

1. Guide to the Expression of Uncertainty In Measurement. BIPM, IEC, IFCC, ISO, IUPAC,IUPAP, OIML., 1995

2. Measurement System Analysis Reference Manual, the Automotive Industry Action Group, 1995

3. ISO/TR 14253-2:1997 Geometrical Product Specifications (GPS) - Inspection by measurement of workpieces and measuring equipment - Part 2: Guide to the estimation of uncertainty in GPS measurement, in calibration of measuring equipment and in product verification.