Quality Control and quality assurance are important functions in most businesses from manufacturing to software development. For most, this means that one or more people are meticulously inspecting what's coming out of the factory, looking for imperfections and validating that requirements for products and services produced are satisfied. Often times QC and QA are performed manually by a select few specialists, and determining suitable quality can be extremely complex and error-prone.
This is a post about quality assurance automation using statistics and
What is statistical quality control?
Statistical quality control is a quantitative approach to monitoring and controlling a process. The best way to explain it is though an example.
Say you're the manager at a factory that manufactures lug nuts. And let's suppose your 10 mm long lug nuts continue to function within a 10 percent margin of error (i.e. customers have a tolerance for error of roughly +/- 1 mm in length). As long as your producing lug nuts measuring between 9 and 11 mm in length, you'd consider your machine to be functioning as designed.
How would you know if your machine has suffered a malfunction? A 9.7mm lug nut could be the sign that your machine is producing lug nuts that are too small, or it could just be natural error that occurs for a machine that's supposed to make 10mm lug nuts.
Take a look at the plots below. Can you tell which one has experienced a change in the mean?
Framing the Problem
As a smart manager, you're using statistical quality control to identify issues with your machine. You can think of each lug nut as an observation. Since we're trying to make 10 mm lug nuts, we will assume that the mean lug nut (lug nut) is 10 mm. This means that over time, the mean lug nut diameter should approach 10 mm. We're also going to assume that our machine's mistakes are normally distributed which in our case means that more lug nuts are much more likely to be closer to 10 mm than farther away from it.
So we've come up with a good framework for our problem, now what?. Enter the
qcc package in
R. This magical little library was built by Luca Scrucca for nothing but statistical quality control. It's extremely easy to use. You provide it with data and it tells you which points are considered to be outliers based on the Shewart Rules. It even color codes them based on how irregular each point is. In the example below you can see that for the last 10 points of the 2nd dataset I shifted the mean of the data from 10 to 11.
You can also define training/test set from withing
qcc. Simply add the data you want to calibrate it with as the first parameter, then add the parameter
newdata with your test data (see code above and plot below).
Some processes might not be have normally distributed errors, but I've found that there are often ways in which you can transform your error term to make it behave normally. It all just depends on how creative you are.
Building Your Own Quality Control Charts
As great as
qcc is, it doesn't have my favorite type of statistical quality control - The Western Electric Rules (WER). The WER were first used by (you guessed it) the Western Electric Company as a way to standardize how their employees monitored their electric lines. While the Western Electric Co. isn't around anymore, the rules they came up with are still really useful for monitoring processes. In a minute we'll show you how to implement them yourself, but first let's explain how they work...
The WER are remarkably straightforward and intuitive. For a recurring process take a sampling of points and measure the mean and the standard deviation. We'll use the mean as the "center-line". Then create 3 zones above and below the center-line, each 1 standard deviation in width.
Based on these zones, the Western Electric Co. came up with a set of rules to determine if a process is broken:
- One point lies beyond Zones +/- 3
- 2 out of 3 consecutive points lie in Zone +/- 3 (and on the same side of the center-line)
- 4 out of 5 consecutive points lie beyond the Zone +/- 2 (and on the same side of the center-line)
- 8 consecutive points lie one the same side of the center-line
Implementing them on your own
Despite how cool the WER are, they aren't in the
qcc package. Luckily, with
R they shouldn't be too tricky to implement ourselves.
Defining the Zone
The first thing we need to do is define the thresholds for each of the zones. Each zone is one standard deviation in width and there are 3 zones on each side of the center-line. Since we also want to know what the top/bottom of
Zone +/- 3 is, we'll need to calculate 8 zones. What we end up with is a grid. The numbers in columns 1 and 2 correspond with the boundaries for
Zone -3, columns 2 and 3 correspond with
Zone -2, etc.
Since we know what the range is for each zone, now we need to determine which zone every point falls in. First we're going to compare our points to each zone by using
x > zones. This gives us a giant matrix of
TRUE/FALSE values. We can then use this to calculate the zone that each point falls into by summing the rows (TRUE/FALSE evaluates to 1/0 when summed). We can use
rowSums which does row-wise summation on a
data.frame/matrix. The value of each item is the zone which it belongs in plus 4 (the extra 4 is because a value of 1 maps to zone -3), so we subtract 4 from the vector and...voila we have the zone that each point falls into.
Once we've determined which zone a given point falls in, we can compute rules for each index within the group. If a given index violates a rule, we flag it with a +/- 1 (+ for zone above center-line, - for zone below center-line).
Putting them together
Using the functions we've defined, we can now find compute the rules for each point adn then assign a color to any violations.
Visualizing It All
With all of our data points, we can now make a quality control chart. We're going to use the original points and overlay them with the zones and then make each point the color of the rule if breaks (if any).
You can get the entire script here.
Deploying to Yhat
Deploying this one is really easy. Since we've encapsulated most of the hard part into our helper functions, we just need to call
compute_violations on our series of data. We can bypass the
model.transform function since we're working with the raw data itself, and we don't have any external dependencies so we don't need to fill out
Even though it's an old topic, statistical quality control is still highly relevant. While you might not be working at a lug nut factory, you probably have lots of jobs, processes, logs, or database metric that you could monitor using control charts.