problems associated with the statistical reporting and analysis of ultratrace
J. Bzik, Air Products and Chemicals
application of standard tools such as control charting and
process capability indices to ultratrace data has not proven as beneficial
in the semiconductor industry as in other industries. Additionally,
nonstatistically based rules for accepting or rejecting product based
on ultratrace specifications have not tended to work as well
as past experience with nonultratrace processes has suggested.
This article describes the problems associated with the analysis and
use of ultratrace data. The reasons for the relatively persistent misapplication
of data analysis tools to such data and the costs involved are also
examined. Additionally, possible solutions, both statistical
and nonstatistical, are developed.
of Analyzing Ultratrace Data
consumers and producers of ultrapure products are constantly seeking
materials that meet tight purity specifications. Such products command
a premium and are relatively more difficult to manufacture and measure
than less-pure materials. However, measurement systems often serve as
the limiting constraint. In typical product specification negotiations,
the focus is appropriately on manufacturing capability, while analytical
capability is often underemphasized or even ignored. Consequently, ultratrace
analytical methods are routinely used close to, at, or beyond their
capability. Overestimating ultratrace process capabilities leads to
the phenomenon known as statistical ratchet: When process capability
is overestimated, a lower impurity specification is often requested,
which leads to overly tight specifications.
unrelenting pressure on specifications without an understanding of measurement-system
limitations causes predictable problems. Substantial business costs
accrue from the products for which unrealistically tight specifications
have been set. Some of these incurred costs have been beneficial; for
example, the added costs resulting from the limitations of analytical
measurement systems have driven the long-term progress in ultratrace
measurement capability. When sufficiently reliable, newly available
ultratrace data can help guide process improvement programs. However,
ongoing improvements in manufacturing or analytical capabilities are
typically swamped by the relentless pressure for better specifications,
allowing problems to recur.
practice, ultratrace data are routinely misinterpreted in production
situations. There is a strong systematic bias to conclude wrongly that
statistical control has been lost. False indications of quality issues
typically exceed those involving material acceptability. Producers of
ultrapure materials conduct numerous root-cause analyses that in fact
amount to searches for nonexistent problems. When
a root-cause analysis identifies a nonexistent problem, further resources
are wasted trying to fix it. Producers may be pressured into modifying
processes because of customer misperceptions that statistical control
is frequently lost. Much effort is consumed by the internal and external
debates that originate with quality misperceptions and the subsequent
misguided efforts at fixing process or measurement protocols.
falsely classified as out of specification are part of a producer's
cost, but producer quality control costs ultimately lead to higher material
costs for the customer. Other consumer quality control costs include
the expenses incurred for remeasurements of product, quality audits,
issue documentation and resolution, lost production time resulting from
waiting for "good" materials, production risk assessment of "marginal
material" use, and implementing quality improvement initiatives. These
are only some of the costs associated with having a high false-alarm
rate, which creates an atmosphere of suspicion. Business pressures and
the pervasive use of inappropriate statistical methods fuel this costly
problems related to the statistical analysis of ultratrace data originate
from the perception that there is too much measurement uncertainty and
from the traditional response to that perception: data censoring. Statistical
quality control/statistical process control (SQC/SPC) practices compound
the situation by routinely treating censored data as real data for statistical
analysis, not appropriately accounting for sample size in control-limit
estimations, ignoring the increased testing risks that arise from the
use of multiple statistical tests, setting testing risk on a per-measurement
basis, using flawed distributional assumptions, and underestimating
Shewhart control chart for individual points, such as the one shown
in Figure 1a, is often used in ultratrace SQC/SPC analysis. Control
limits (represented by the dashed lines) are calculated as either
3s (from in-control data where ,
= average absolute range between neighboring points, and s =
pooled variance between neighboring points). Either calculation
provides a nominal 3σ equivalent control chart.1,2 The
nominal risk of finding an out-of-control data point for an in-control
process is 0.27% (the implicit data-normality assumption). However,
the true level of risk depends on the sample size and increases rapidly
with decreased charting data. Consequently, it is advisable to use at
least 25–30 points in chart construction.
1: Examples of control charts used in SQC/SPC analysis: (a) a Shewhart
individual control chart, and (b) a moving-range chart. Both indicate
upper and lower control limits (dashed rules). In addition, chart
1a indicates the process average and chart 1b shows the average
absolute range between neighboring points (solid rules).
correcting for sample-size risk is a common failing. The use of 30 points
in control-limit chart construction provides a 0.77% chance that the
next in-control measurement will be determined to be out of control
(about three times the perceived nominal risk). When a process is in
control and normally distributed, there is approximately a 7% risk that
a chart will indicate a loss of statistical control within the 30 points
used to build the chart and about a 17% risk that there will be a loss-of-control
result in the next 30 points tested. Typical distributional characteristics
of ultratrace data often exacerbate these risks, which are compounded
when multiple parameters are being charted.
risk is traditionally controlled at too low a level (single-property
measurement) while testing frequency is ignored. If 30 points are used
in constructing Shewhart individual control charts, the wider in-control
ranges represented by ±
3.47s should be used to achieve the true normal 3σ risk
level of 0.27%. With that change, the 7 and 17% risks described above
will be reduced to 1.6 and 6.3%, respectively. However, such a change
is only a partial fix because quality control risks must be controlled
at the product level rather than at the product-property measurement
risk of finding the next in-control point out of statistical control
in training data sets using n = 10 or n = 20 points
in control-limit chart construction with ±
is 1.1 or 2.4%, respectively. Sequential control-limit recalculation
with large-sample-size rules will not resolve this issue; if limits
are simply recalculated starting at n = 10 until n
= 30, then 24.3% of the in-control sequences will be misidentified as
out of statistical control. Control charting with limited data requires
the use of appropriate statistical multipliers.
charts are sometimes used in conjunction with the Western Electric rules
(tests for special causes). These rules focus on the identification
of data patterns that indicate a systematic loss of control. For example,
one version of the "run rule" says that if eight contiguous
points are consistently above or consistently below the average, the
process is not in control. If 30 in-control points defined the average,
then application of the run rule for 30 additional in-control points
would provide a 13.1% chance that the process will be found out of control.
Other Western Electric rules are typically applied in a systematically
uncontrolled manner and are sometimes statistically inefficient. Statistical
methodologies that control all statistical risks simultaneously
and more-efficient tests are needed to replace the Western Electric
moving-range chart (Figure 1b) is often used for process control when
point-to-point variation is unusually large. If such a chart is used
in conjunction with a Shewhart-type chart, its risks also need systematic
integration. A calculated ultratrace lower control limit (LCL) is effectively
irrelevant when it is at or below the method detection limit (MDL).
Additionally, some practitioners ignore the LCL, since too little impurity
is perceived as better than than too much. In such cases, the risk of
a false out-of-control finding is one-half of that previously stated.
Because the purpose of using control charting is to achieve product
consistency, the use of above-MDL LCLs is an important consideration.
products often must meet multiple ultratrace-impurity specifications,
which compounds the aforementioned statistical problems. Analysis of
30 trace metals, for example, results in 30 simultaneous sets of data.
If each trace-metal analysis is based on using a 30-point in-control
training set to calculate a nominal 3σ-equivalent control chart, there
is a 90% risk that one or more of the 30 charts will show that the process
violates its control limits. If these limits are applied to the next
in-control product tested, there will be 20.7% risk that one or more
of the control limits will be found to have been exceeded. When there
is such a repeated high risk of obtaining out-of-control statistical
results from in-control processes, statistical mayhem—with all
its business consequences—will result.
more product properties are specified, the worse the situation becomes.
Smaller training data sets present greater risks. Even when 30-point
in-control training sets and the recommended ±
3.47s are used to calculate wider control limits to obtain
true 3σ risk level, there is a 38.4% risk that one or more of the
charts for 30 trace metals will show that an in-control process violates
its control limits. When these limits are applied to the next in-control
product tested, there will be a 7.5% risk that one or more control limits
will appear to have been exceeded. Controlling each chart's risk level
is helpful, but simultaneous control of risks for all of the charts
is required. Such control is not current industry practice.
addition, although Shewhart individual control chart performance degrades
when in-control training data are not normally distributed, the normality
assumption is rarely checked in practice. Some ultratrace data are anticipated
to be nonnormal in distribution. The p-chart (assumed Poisson distribution),
while often suggested for use with particle count data, is a flawed
model for most ultratrace SQC/SPC. A single-parameter Poisson distribution
incorporates only sampling error into the control limits. It does not
effectively incorporate random process or random measurement variation.
The effect, once again, is a greater-than-nominal risk of indicating
that an in-control process is out of control.
perception that there is too much measurement uncertainty and the ways
that uncertainty is handled are critical to the successful statistical
analysis of ultratrace data. If an ultratrace purity specification were
100 ppt and the measurement standard deviation were 50 ppt, that 50%
relative standard deviation would typically be considered unacceptably
large and the measured results not very trustworthy. The four common
responses to this situation are data censoring (using rounding, physical
limits, or detection limits), signal averaging, improving the measurement
method, and leveraging data characteristics (e.g., image enhancement).
Of these four techniques, data censoring is perhaps the most problematic,
because it actively de-enhances data resolution and damages subsequent
statistical analysis method reliability.
statisticians are aware that the combined use of rounding, physical
limits, and detection limits is problematic, they are often hesitant
to question standard data-handling practices. Consequently, most statistical
repair efforts focus on simple substitutions, nonnormal distributions,
or nonparametric methods. These efforts have had limited success, since
damage caused by data censoring can, at most, be only partially undone
by subsequent manipulations. The optimal time to apply advanced statistical
routines is after everything possible has been done to clean up the
Data perceived to be too imprecise might be subjected to "comfort"
rounding so that measurement reliability is not overrepresented. Practitioners
of this method generally do what feels right based on the situation.
For example, data values might be rounded (censored) to the nearest
0.1 ppb if the standard deviation is 0.06 ppb. Rounding to significant
figures or digits is a more formalized technique than comfort rounding,
but its rules are subject to interpretation. Published significant-figure/digit
rules differ, such as the approximations provided for propagation of
error. Issues such as which uncertainty sources to include and how to
calculate the uncertainties to be propagated are typically undefined.
to communicate both a measured value and an implicit measure of its
uncertainty via a single result, the use of significant figures or digits
always reduces the reliability of estimated statistics; the degree of
damage to SQC/SPC statistics is determined by the relative aggressiveness
of the rounding. In the case of an in-control process with high relative
measurement uncertainty, for example, very few unique values may be
left after rounding. Indeed, having very few different measured values
is the typical data signature of overaggressive rounding. The subsequent
use of a continuous distribution methodology, such as a Shewhart control
chart, on such artificially overly discrete data will be problematic.
Limits. Two of the physical limits used in the analysis of
ultratrace data are zero (indicating that a specific measured contaminant
is not present) and 100% (indicating the highest assay level). However,
measurement systems sometimes indicate contamination levels of less
than zero or an assay level higher than 100%. The causes of such unachievable
results include measurement error, flawed calibration, or the subtraction
of one result from another, each of which is subject to measurement
error. Traditionally, such results are censored to physical limits.
While that form of censoring seemingly improves the data, it actually
underestimates uncertainty and over- or underestimates process averages.
Figure 2 illustrates the effects of adopting a physical limit of zero.
Censoring caused the misrepresentation of the data distribution's shape
(in the graph at left), overestimation of the process average (0.12
censored versus 0.00 uncensored), and underestimation of the standard
deviation (0.17 censored versus 0.31 uncensored). Such physical-limit
censoring can have a significant effect on control charting and may
obscure calibration and other measurement system issues as well.
2: Graphs showing the impact of applying a physical limit to ultratrace
measurement data. A physical limit of zero was used in creating
the graph at left, while the graph at right depicts uncensored data.
Limits. A detection limit is a figure of merit for a detector
or measurement instrument or process. Results below a specified detection
limit are considered too unreliable for use in most statistical or practical
applications. Although it has generally ceased, an older, questionable,
practice was to record zero or "nondetectable" for below-detection-limit
measurements without explicitly stating what the limit was. This practice
created an incentive for producers of high-purity products to either
use less-capable measurement methods than were then available or to
claim artificially high detection limits. The misleading data that resulted
led to the development of standards for using detection limits with
ultratrace data. A current SEMI standard requires that measurement methods
have an MDL at or below the specification being measured.5,6
(Other information on determining detection limits is also available
from SEMI and elsewhere.7,8)
are many definitions of the term detection limit. Variants
include order-of-magnitude differences.9–16 Many acronyms
for MDL (e.g., DL, LOD, LLD, LLOD, LOGD, and LOQ) are also used. However,
the following discussion follows SEMI's use of MDL, which includes all
major sources of measurement method variability.
are useful as measurement system figures of merit relative to specifications,
but they are problematic when used to censor measurement data. Although
that practice is not traceable to the SEMI MDL definition, it seems
so intuitively logical that it goes unquestioned. Such censored data
are either shown as the value of the MDL or as "<MDL." The latter
form makes the censoring more apparent. During statistical analysis,
<MDL data are typically treated as the MDL or as one-half the MDL.
Either version leads to subsequent statistical mayhem.
censoring using the MDL also affects statisticians' ability to utilize
the process capability measures Cp and Cpk (Cp
= the number of 6σ intervals fitting into the specification interval
and Cpk = the number of 3σ intervals fitting between
the process average and nearest specification). Cp can only
be defined when there are both upper and lower specifications. Therefore,
in ultratrace data situations where the MDL is treated as the lower
control limit, only Cpk is applicable. However,
determining Cpk for ultratrace data can lead to a potentially
gross overestimation of process capability, as the following example
3: Control charts showing the effects of replacing below-MDL data
with the MDL for statistical analysis: (a) a Shewhart chart, and
(b) a moving-range chart.
I presents actual measurements for 26 data points and the comparable
results when a detection limit of 30 has been applied. Based on these
data, Figures 3–5 graphically show the effects of different types
of data substitutions (detection limit, one-half detection limit, and
zero) on Shewhart and moving-range control charts, while Figure 6 shows
charts for the uncensored data. The SQC summary statistics for the four
data sets are given in Table II. As this table indicates, averages ranged
from 2.4 to 30.1 depending on the type of substitution. Only the average
for the one-half-MDL data was relatively close to that for the uncensored
data for these particular ultratrace data points. The standard deviation
and the upper control limit of the censored data were universally lower
than those of the uncensored data, and some substitutions even resulted
in an upper control limit that was less than the MDL. In addition, when
three hypothetical specifications were evaluated, process capability
when based on censored data was judged to be higher than when based
on uncensored data, as shown in Table III.
4: Control charts showing the effects of replacing below-MDL data
with one-half the MDL for statistical analysis: (a) a Shewhart chart,
and (b) a moving-range chart.
5: Control charts showing the effects of replacing below-MDL data
with zero for statistical analysis: (a) a Shewhart chart, and (b)
a moving-range chart.
6: Control charts for uncensored below-MDL data: (a) a Shewhart
chart, and (b) a moving-range chart.
uncensored data in this example have a relative standard deviation of
46%. While each uncensored measurement in the collection of 26 data
points is relatively unreliable, the results from these points can be
reasonably statistically analyzed. In contrast, the use of a single-number
MDL substitution in statistical analysis always results in biased averages
(both underestimations and overestimations), biased standard deviations
(typically underestimations), biased upper control limits (typically
underestimations), and a biased Cpk (typically overestimations).
Once data have been MDL censored, there is no graceful recovery. The
use of advanced statistical methodologies for censored data (e.g., maximum
likelihood estimation or correction factor tables) can only recover
some fraction of the original statistical information.17
Out of Control
II: Comparative statistical analysis scenarios using data from Table
I and three types of data substitutions.
MDL is the most common substitution made for the purpose of statistical
analysis. Its use often results in over- or underestimations, as indicated
in Tables II and III. The Cpk of 39 (spec = 50) in Table
III misleadingly suggests that the specification for a process that
has a true Cpk of 1.1 can be tightened. And in Table II,
the standard deviation of 0.2 for the censored data is much lower than
that for the uncensored data (9.2). In conjunction with insufficient
statistical controls for small sample sizes and multiple-property testing,
such flawed analyses lead to predictable misinterpretations in ultratrace
data analysis. The failure to grasp the nature and scope of these routine
misinterpretations has proven costly to both producers and consumers
of ultratrace materials.
Spec = 50
Spec = 40
Spec = 35
III: Comparative Cpk statistical analysis scenarios using data from
Table I, three types of data substitutions, and three hypothetical
related to MDL data censoring are exacerbated by another traditional
practice: MDL rounding. Many reported MDLs are rounded up, sometimes
aggressively. For example, an MDL estimated at 63 ppt is likely to be
reported as 100 ppt. High MDLs that are rarely exceeded are sometimes
used as a crude technique for masking detection limit–induced
common statistical practice involves using the standard deviation of
all the data in control-chart construction rather than basing limits
on a short-term estimate of variability. This error allows process variability
(if any) to widen the control limits and is not a good mechanism for
offsetting the risks incurred by employing data-censoring practices.
other cases, ultratrace specifications may be set lower than what is
known about process or measurement variability and the MDL. Possible
fixes for this problem include requiring that specifications be no lower
than an MDL multiple (e.g., 2, 3, 5, or 10) and requiring a specific
gauge capability (e.g., a measurement procedure variation of <10%
of the product variation or, in the absolute worst case, <30% of
the product variation). However, such practices do not resolve all problems
that arise in handling censored data.
issue is what happens when a specification and a high MDL that is not
biased are equal. Very substantial problems result, and a solution would
require additional fundamental changes in how such relatively tenuous-quality
data are interpreted. To avoid having to use different protocols to
interpret such data safely, specifications should be set to an absolute
minimum of twice the relevant MDL.
although quantifiable results are available for most ultratrace data
currently reported as being below the MDL, it should be remembered that
some detection limits are real. For example, it may be impossible to
discern a peak because of instrument white noise. Advanced statistical
methods are required for ultratrace data having a mix of quantifiable
measurements and true nondetects.
Issues and Solutions
statistical methodologies used to analyze ultratrace data must function
robustly in a nonideal environment. One way to address that challenge
is the use of a simulation methodology that allows many statistical
issues to be anticipated and planned for by determining the likely impact
of an SQC/SPC strategy. Simulation requires the use of data distributions
that are similar to real distributions and the use of all applicable
some circumstances, ultratrace data can be in control and nonnormally
distributed. When the upper tail of the distribution is more dispersed
than the lower tail, the upper control limit of a Shewhart control chart
may seem to have been violated. Such high nonrepresentative process
measurements can result when outside contaminants have infiltrated the
sampling or measurement environment. A mixture of distributions, often
asymmetric, will result, perhaps with outliers in the upper tail. Therefore,
it is premature to fit a nonnormal distribution until it has been verified,
by remeasurement and resampling, that the nonnormality is inherent rather
than an artifact of spurious contamination sources. If the offending
sampling or measurement system is difficult to fix, quality-decision
rules must be designed to compensate for what is known about the system's
of the electronics industry have been attempting to define statistical
specifications (e.g., only in-control product is automatically acceptable).
In light of the many issues discussed in this article and the numerous
underlying and unaddressed SQC standardization issues (e.g., different
companies requesting different statistical practices), these initial
efforts are premature.
practitioners have a strong systematic bias to falsely conclude that
process control has been lost, to overestimate process capability, and—when
a specification and MDL are too close—to misclassify product quality.
Producers and consumers of ultrapure materials incur substantial business
costs as a result of such flawed data handling and analysis. Much effort
is wasted attempting to fix process problems that are in reality predictable
this challenge requires major changes in the treatment and handling
of ultratrace data. Data should not be censored, but if they must be,
rational bounds should be built into the SQC/SPC procedures to provide
a margin of error. Detection limits should be treated only as figures
of merit for measurement processes. Specifications should not be set
too close to a detection limit, nor should physical limits be applied.
Ultratrace data should not be rounded aggressively, and all statistical
risks in SQC/SPC testing should be evaluated and balanced simultaneously.
To account for small sample sizes, statistical multipliers should be
modified. Advanced statistical methods should be applied only to data
that have not been censored needlessly.
changes will be strongly opposed. Fear of data overinterpretation has
led the industry to systematically misinterpret ultratrace data while
simultaneously corrupting statistical tools and surrendering vital information
on process states. Customer education and the development of robust,
standardized ultratrace statistical and data-handling methodology will
be required. The Statistical Methods Task Force in SEMI Standards Process
Chemicals is working to develop standard practices that are relevant
to the analysis of ultratrace data and the development of statistical
specifications. Although serious effort will be required to overturn
accepted, but inappropriate, practices, the rewards will be worth it.
DC Montgomery, Introduction to Statistical Quality Control,
2nd ed. (New York: Wiley, 1991).
EL Grant and RS Leavenworth, Statistical Quality Control, 5th
ed. (New York: McGraw-Hill 1980).
TJ Bzik and SN Kamat, "Control Charting with Limited Data,"
in Proceedings of the Section on Quality and Productivity—American
Statistical Association (Alexandria, VA: American Statistical Association,
TJ Bzik and SN Kamat, "Benchmarking of a New Short Run Control
Charting Methodology" (paper presented at the American Statistical
Association Annual Technical Meeting, (Chicago, August 7, 1996).
TJ Bzik, "Statistical Reporting and Analysis of Ultratrace
Measurement Data: Issues and Solutions" (paper presented at Semicon
West, San Francisco, July 22–24, 2002).
SEMI C1—Specifications for Reagents—Method Validation
(San Jose: SEMI, 2003).
SEMI C10-0305—Guide for Determination of Method Detection
Limits (San Jose: SEMI, 2005).
Air Products and Chemicals, MDL Estimator Software; available
from Internet: www.airproducts.com/products/specialtygases/northamerica/download.htm.
PC Meier and RE Zund, Statistical Methods in Analytical Chemistry
(New York: Wiley, 1993).
TJ Bzik, "Method Detection Limit Estimation Theory and Practice,"
in Proceedings of the Annual Technical Meeting of the Institute
of Environmental Sciences and Technology (Mount Prospect, IL: IEST,
"Limit of Detection," chap. 8 in Specialty Gas
Analysis: A Practical Guidebook, ed. J Hogan (New York: Wiley,
SN Ketkar and TJ Bzik, "Calibration of Analytical Instruments—Impact
of Nonconstant Variance in Calibration Data," Journal of Analytical
Chemistry 72, no. 19 (2000): 4762–4765.
DW McCormack Jr., "Analysis with Data Beyond the Detection
Limit," (paper presented at the Semicon West Statistical Methods
Workshop, San Francisco, July 13, 2004).
TJ Bzik, "Method Detection Limits and the Statistical Treatment
of Censored Trace Data" (paper presented at the Semicon Europa
Analytical Methods Workshop, Munich, April 13, 2005).
ISO 11843-1, Capability of Detection, Part 1: Terms and Definitions
(Geneva: International Organization for Standardization, 1997).
ISO 11843-1, Capability of Detection, Part 1: Methodology
in the Linear Model Case (Geneva: International Organization for
TJ Bzik, "The Statistical Treatment of Censored Trace Data"
(paper presented at the Semicon West Statistical Methods Workshop, San
Francisco, July 13, 2004).
J. Bzik is a research associate in statistical sciences.
From 1980 to the present, he has worked at Air Products and Chemicals
(Allentown, PA), and from 1978 to 1979, he was at the Center for the
Environment and Man. His areas of expertise include limits of detection,
calibration, SQC, experimental design, regression analysis, and environmental
statistics. Bzik has chaired the task force devoted to the development
of SEMI standards for method detection limits, precision reporting,
and method validation in statistical methods and the task force in the
area of method validation in the process chemicals division. He has
also contributed to the statistical components of FED-STD-209D, FED-STD-209E,
and their subsequent international successor, ISO/TC 209. Bzik has worked
with International Sematech, has contributed to the International Technology
Roadmap for Semiconductors, and has also produced more than 60 publications.
In 1975 he received a BS in mathematics from Kutztown University in
Kutztown, PA, and in 1977 he received an MS in statistics from the University
of Connecticut in Storrs. (Bzik can be reached at 610/481-6650 or firstname.lastname@example.org.)