Blog

  • Welcome!

    Welcome to my blog! I’ve been working at what I call “industrial statistics” and exploratory modeling for almost 30 years. I’ve dealt with a large range of data types and problem situations, and I find that my colleagues regard my judgment and insight.

    I’ve learned that having a breadth of technical knowledge is critical, but when such knowledge is wedded to good judgment or intuition about meeting the needs of a particular situation or data set, great things can happen. In this blog, I hope to impart some of the intuition I’ve accumulated.

    I’m writing this for data analysts and those who work with them. I don’t assume technical knowledge, but I do assume interest in strategies for meeting challenges and understanding features of data-based decision-making. This is not a “how-to”, but rather a “why-to”.

    Please comment if you feel moved. I’m interested to hear your perspective on the topics I’m discussing.

    You can be notified of new posts in one of the following ways:

    • Subscribe to an e-mail notification.
    • From a Fediverse account (such as a Mastodon account), you can follow {at jimg at replicate-stats.com}. (Put this address in the format @user@site-name. I have to write it this way because this editor automatically converts it to the address to my author profile.)
    • Use an RSS reader and subscribe to https://www.replicate-stats.com/feed.

    Essays

  • Whither statistics and machine learning

    This essay goes out to statisticians who are wondering how Machine Learning (ML) compares to classical or parametric statistics, whether it will affect their jobs, or what role it should have in their current work.

    I recently presented some thoughts at the International Alliance for Biological Standardization’s annual statistics conference; the audience was primarily statisticians supporting pharmaceutical manufacturing. The conference was small and familiar; there was plenty of free-form discussion. In that discussion, questions about ML and its place in manufacturing (in a highly regulated context) surfaced repeatedly.

    Is “machine learning” a statistical endeavor?

    Is ML statistics? Should a statistician use ML tools? If so, when, or when not? Could ML tools work effectively on data coming from sparse designed experiments?

    As someone who treasured his copy of Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning when it first came out, there is no question in my mind that ML is statistical. It says “statistical” right there in the title! And the authors are statisticians, academics employed in statistics departments. Further, the discussion is founded on probability and leans heavily on statistical concepts and model-fitting principles (e.g., the classic bias vs. variance trade-off and optimizing penalized likelihood).

    I also very much appreciated Brian Ripley’s book Pattern Recognition and Neural Networks which did likewise. The publisher’s description says exactly so:

    Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.

    Therefore I’ve never had any doubts that ML can be viewed and handled effectively through a statistical lens. Indeed, I’m convinced that a statistical lens is very much the most effective lens to use.

    That last statement may sound tribal—“Statisticians are the best!”—but that isn’t my intent. Rather, it’s a tautology, at least in the framework I’m using.

    I consider statistics to be the study of making decisions based on noisy data. The theoretical grounding for statistics is in “decision theory,” which is game theory in which your opponent is Nature. Nature’s move in the “game” is to determine a state that is unknown to you, while the statistician’s move is to make a decision whose loss depends on Nature’s unknown state. But first, you get data. How do you use that data to make a decision? The mapping from the data to your decision is a “decision rule,” otherwise known as a “statistic”. The questions of theoretical statistics concern optimal decision rules, and why they are optimal.

    Put more bluntly, statistics is the study of what makes decision-making based on noisy data work well. Therefore if some technique or idea works better than “standard” statistical methods, it’s incumbent on statisticians to verify that the new method is superior, identify why it works, and update the current state of the art accordingly, much as physicists had to expand beyond Newton’s laws when Einstein’s theories were experimentally verified. Then the new ideas, whether they originated from a statistician or not, become part of the world of statistics. That’s the case in an ideal world, at least, which may take some time to manifest. (As it does in science as well!)

    So, when I assert that a statistical lens is the most effective lens through which to view ML, I’m really asserting that we need to bring all we know about how model-fitting can be made to work well to new (to us) methods in ML. And we do indeed have a few tricks to offer.

    A hierarchy of modeling needs

    We have to judge any model relative to what we want to accomplish with it. Here I’ll borrow shamelessly from Maslow’s Hierarchy of Needs, with apologies.

    • Faithfulness to data: All useful models need to predict well in the sense that we can feed in specific cases and get predictions that match actual responses to within expected error. We expect this even of models that we don’t intend to use primarily for prediction. If a model fails to be faithful to the data, all else is moot–in fact, interpreting such a model could be dangerous.
    • Variable importance: We may want to know what predictor variables are important in contributing to the response.
    • Quantify relationship: We may want to quantify how much an important variable contributes to the response (or to the mean of the response distribution). This is usually expressed as saying that a unit change in the predictor induces some change in the (mean of the) response.
      • With linear models the fitted slope provides this, but this concept generalizes to smoothly non-linear contributions. In this case the change in response would vary depending on the starting point of the contribution: for men 5 feet tall, an inch increase induces such-and-such a change in mean response, but for men 6 feet tall, the induced change is much less, for instance.
    • Statistical evidence: We may want statistical evidence that this effect is not nil, sufficient to convince a skeptical audience, such as regulators, article reviewers, and perhaps business leadership. Or we may want confidence intervals or equivalents which indicate how precisely we think we know the quantities.

    Certain ML methods address some of the needs above. All should address the faithfulness need. Many will address variable importance (and anyway there are ways to assess variable importance from black-box models). GAM’s and MARS allow for non-linear relationships but allow slopes to be assessed visually (or computationally). GAM’s can provide approximate statistical evidence, although any statistical evidence is questionable when it comes at the end of an exploration involving many variables.

    There are contexts in which ML satisfies an important need; one needs to simply use the right tool for the right context.

    How does machine learning fit within the practicing statistician’s toolbox?

    If I view ML as statistical, how do I blend ML with my other statistical tools? When do I reach for Random Forest, or MARS?

    Here are the contexts for which for which I would consider turning to ML:

    • When there is a strong possibility of a complex mapping between predictor variables and the response. By “complex” I mean interactions among predictors, and I have enough data that identifying a complex interaction seems plausible; also, likely non-linear contributions from predictors.
      • Or, I have very little domain knowledge which can give me confidence that interactions are not present, and that contribution relationships are not non-linear. ML provides assurance.
    • Having any hope of modeling complex relationships requires substantial data. Therefore I would not entertain a ML model with a small number of cases.
      • There is a bit of nuance here. MARS can devolve gracefully to linear regression modeling when there aren’t enough cases to support further expansion.
      • Neural nets and SVM’s can fit small data sets, but in order to determine how well they fit, and to make intelligent choices about key questions such as the number of hidden nodes and the roughness penalty (weight decay), we need more data.
      • Tree-based methods simply need sufficient data to fit anything reasonable.
    • When there are many predictor variables, especially when there is a possibility that some of those may contribute to the response in a non-linear yet smooth way. Tree-based ML methods can do a good job of identifying important predictor variables.
      • Note that if you have many predictor variables about which you have little prior information or expert knowledge, statistical inference about specific variables will be highly fraught regardless of the method you adopt.
      • Wait, what about Bayesian methods such as Bayesian Model Averaging and inference with so-called “horseshoe” priors? Yes, these provide a reasonable inference, but it isn’t straightforward to include non-linear components. This is an area meriting further research and development. Assuming that every predictor contributes linearly is simply not a viable plan. My opinion might be in a minority among statisticians but I’m standing by it.

    Looking over this list, I see a theme. ML models are flexible and make fewer assumptions about relationships. I lean towards ML if one is exploring and has a generally low state of knowledge. There are many candidate predictor variables, and one doesn’t know exactly what they mean, which are important, and whether their contribution is linear, or over what range. There may be complex interactions. In this context, ML can answer some basic questions quickly: Can the predictors predict? How well? Which are important? Over time and additional experience one may learn which variables are important, whether they contribute linearly or with what other pattern, and which interactions are certainly not present. As knowledge of this sort grows, a parametric (or semi-parametric) model becomes more appealing.

    This presupposes that you’re dealing with “found” (i.e., observational) data. If one can design experiments prospectively, we have ways of addressing these early questions much more effectively.

    Can ML/AI deal with designed experiments?

    One question I’ve been pondering (successfully, I think), and which surfaced at the conference, was: can ML deal with designed experiments?

    We are told that AI, and by extension ML, have enormous potential to optimize systems. Yet some of us have been optimizing systems successfully for decades. Are we missing something? Has AI come up with a wholly new way to go about this? Are we missing something?

    I contend that traditional statistical design and optimization will continue to have value. Moreover, in many cases there is no benefit to fitting a ML model rather than the sort of models we have been. I’ll outline the features of those cases when there would be a benefit to using ML.

    It helps to know a little about how optimization works. If you ask a computer to carry out a search for an optimum, what does the computer do? Of course there are many optimization algorithms, and there may not be one that works best for all contexts, but very of them have a simple foundation: approximating a surface locally by a simple polynomial (linear or quadratic):

    1. Pick a location (vector of parameter values).
    2. Identify the gradient (first derivative). This can be done analytically, if an analytic first derivative is known, or it can be estimated numerically by evaluating the function of interest at locations very close to the starting point.
      1. In some cases a second derivative is found.
    3. Either implicitly or explicitly, use the derivatives to approximate the surface locally by a first- or second-order Taylor approximation.
      1. In the linear case, this indicates a direction of improvement. Move in that direction by some regulated amount.
      2. In the quadratic case, the approximation may support an optimum point. Move towards that point, perhaps by a regulated amount.
    4. Repeat until convergence; hope for convergence rather than divergence!

    Note: It isn’t necessary to estimate the surface globally; it’s sufficient to have a series of good-enough local approximations. And Taylor’s Theorem tells us that the surface, if smooth, can be approximated locally by linear or quadratic models.

    Statistical experimental design (i.e., “Design of Experiments”) also works by finding a series of linear approximations that move one towards an optimum. Then, if one uses response-surface methods, we use a quadratic polynomial to make a local approximation that tells us where the optimum is estimated to be, and also models the neighborhood around that optimum (which supports tolerance regions).

    As with numeric optimization algorithms, at no point is a global estimate of the surface required.

    Furthermore, it is often the case that the process under study is very expensive to evaluate; each run may cost multiple thousands of dollars. Therefore we pick runs extremely judiciously. In this case we create very sparse designs. It’s not unusual that the design is only just big enough to fit a specific linear or quadratic; these are “saturated models” which have as many polynomial parameters as design points.

    Putting this all together, my recommendation is as follows:

    • If you already have a large data set of experimental conditions and their associated outcomes, use an ML model to allow prediction of future cases at any location in the design space. Then you can attach a search algorithm to find optimum conditions, or carry out designed experiments in silica. Since we must model the whole space, we can’t rely on a local approximation. We must use a model that is capable of being complex.
    • If you don’t have a large data set in hand, but will have to carry out experimental runs in order to explore the design space, statistical experimental design is the way forward.
      • In analyzing data from these designs, use traditional methods, which are based on first- and second-order polynomials.
      • If you have a problem with special features, you still have to put on a statistician’s hat to decide the best design points to run.

    Given data from a sparse design, an ML model will not improve on the polynomial. In fact, often there won’t be enough design points to allow differentiation from a polynomial model. For example, the optimal design points to fit a quadratic model in one factor has three locations: low extreme, high extreme, and middle. In this (trivial) example, there is no data to support departure from a quadratic.

    Sparse response-surface designs are similar, and so are sparse designs designed intended for early exploration (e.g., Plackett-Burman and fractional factorial designs). There simply won’t be a benefit to using ML.

    What does it mean if a ML model fits better?

    Let’s return to the question of a poorly-fitting parametric model and a well-fitting ML model. An ML model generalizes parametric models in three ways:

    1. It allows for non-linear relationships between predictor variables and response.
    2. It allows for interactions. Some (neural net, SVM, Random Forest) allow for arbitrary interaction. A few (gradient boosting, MARS) allow for a maximum degree of interaction to be specified.
    3. Some ML methods can automatically combine “diffuse” effects and then operate on them non-linearly. A classic example is gene expression; researchers often collect a large number of gene expression values, and we don’t expect one or a handful of genes to be powerful predictors. Rather, we expect multiple genes to express similar information, in a context of overall high variability. Regression models can combine multiple genes in a linear combination; a neural net does this and then can apply a nonlinear transformation to the linear combination. (See also projection pursuit regression and sliced inverse regression.)

    As I said above, there’s no magic here. If a parametric model is inferior to a ML model, it’s because the ML model is tracking the data in some attribute for which the parametric model is too restricted. And the possible generalizations are the three above. So the data set follows one or more of these attributes.

    However, each of these three attributes can be addressed with a wholly statistical tool set:

    • GAM’s allow continuous predictors to make additive contributions via smooth non-linear (or linear) functions. Regression models with spline components—so-called “regression splines”—can also work.
      • Regression splines control curve roughness by limiting the size of the basis rather than with a fitting penalty.
    • An interaction term can be added to any additive model. We simply need to know what predictors the interaction involves.
      • There can be a challenge in finding interaction terms.
    • A multitude of diffuse predictors can be combined into an index via principle components analysis prior to modeling, among other methods.

    A ML model does all this, quickly and in one step (from the analyst’s point of view).

    Interpretable ML

    In researching this essay I came across a line of ML research called “interpretable ML”. I feel embarrassed that I hadn’t heard of this line of research until now! I came across Dr. Cynthia Rudin’s article “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.” It’s freely available and I highly recommend a read. Another article, of which Dr. Rudin is the lead author, and which I highly recommend for anyone thinking about ML, is titled “Amazing Things Come From Having Many Good Models”.

    In this research area, the word “interpretable” in “interpretable ML” means exactly what it says. While this term doesn’t have a precise definition, examples of models presented as interpretable include:

    • Lists of univariate rules combined with “and” and “or” logical statements.
    • Generalized additive models (i.e., predictor variables contribute additively via smooth functions).
    • Tree-structured classifiers.

    Dr. Rudin makes several very interesting points within the two papers above:

    • In almost all cases of predictive modeling, an interpretable model can be found that predicts as well as a complex “black-box” model.
      • Finding the simple model that predicts well may require sophisticated optimization algorithms, techniques that are not in common use today.
      • This doesn’t apply to image processing and text-oriented large language models, but there is potential for special interpretable approaches in these areas.
    • In almost any realistic classification or scoring problem, there are many models that fit approximately equally well. Some are complex but some are interpretable.
      • When noise in the response is high, the possibility of many equivalently-predicting models grows.

    …and Generalized Additive Models for the win!

    At the IABS statistics conference, conversations indicated that there is a widespread presumption that ML models generally predict better than the models statisticians typically fit, which is to say, parametric models. (And, the presumption continues, we should use parametric models anyway, because they’re interpretable.) I’ve long argued that this presumption is not generally true, though it may frequently be true. Dr. Rudin takes it a step farther and argues convincingly that the presumption is almost never true: in almost every practical case, there exists an interpretable model that predicts as well as the best ML model.

    However, the high-performing interpretable model(s) might not be a simple parametric model. Among the classes of interpretable ML models that frequently predict well is the Generalized Additive Model, possibly with some interaction terms. That is, extending the standard parametric model in one or more of the three methods discussed in the section above, What does it mean if a ML model fits better?

    Why is the presumption of ML predictive accuracy so pervasive? I have my suspicions:

    • The most commonly-used statistical modeling methods—linear regression or generalized linear regression models—often don’t predict very well, because relationships often occurring in Nature are smoothly non-linear. Meanwhile, ML models often do predict well.
      • But why do we suffer models that don’t predict well? I believe we’ve been distracted by hypothesis testing. Typically, even if a relationship is smoothly non-linear, a hypothesis test that assumes a linear approximation will still give very strong evidence of a relationship, not least because it gives only one degree of freedom to the effect in question. Thus there is a statistical rule of thumb that linear is “good enough” for practical purposes. But we forget that that “purpose” is testing, and this isn’t always our purpose!
    • AI and ML often do predict well. Nothing in this essay claims that AI/ML doesn’t predict well, and I believe Dr. Rudin’s research doesn’t claim this either. The ability of AI/ML is not in question; that of the alternative’s is.
    • There is an enormous amount of money behind AI. There is of course a lot of hype, and no small amount of that hype is driven by money, and hope, and desire, and news attention, and desire for that attention.
    • “Neural nets can do things multiple regression cannot, therefore neural nets are more capable, therefore they must be able to predict better in classical regression contexts.” This does not logically follow, yet it is a seductive and enduring fallacy. Just because New Shiny Tool B does something that Old Tool A cannot do does not imply that Shiny B can do Old A’s traditional job better. Maybe A and B excel at different tasks.
      • This is the motivating principle behind a large number of published articles comparing the predictive accuracy of neural nets to logistic regression on clinical trial data. This literature has failed to find a consistent general advantage of neural nets, yet researchers continue to try.

    Where this is all winding up is something that deserves a name. I’m going to call it the GAM principle:

    Additive models
    - with smoothly non-linear contributions from continuous predictors...
    - and appropriate interaction terms (if we can find them)...
    - and, for many diffuse predictors, indices combining them into a "latent" predictor...
    ...can predict as well as ML models, very often.

    I use the term “GAM principle” rather than simply “GAM” because a GAM is a bit more specific: it refers to an additive model in which some or all of the continuous factors contribute via smooth functions. These are modeled using smoothing methods, smoothing splines or LOESS. (Nowadays splines seem to predominate.) “Smoothing” with splines involves creating a very large set of basis elements and limiting complexity via a penalty. Alternatively, one could augment a linear model with a more limited set of spline basis elements, and control complexity by the number of basis elements. We refer to this as a “regression spline” model. A regression spline satisfies the GAM principle but is not strictly a GAM. MARS similarly generates a limited set of spline basis elements, and satisfies the GAM principle.

    The GAM principle implies that statisticians can fit interpretable models that are well within the statistics tradition, and predict as well as other ML models, but such statisticians will not be able to do this in a business-as-usual manner. They’ll need to familiarize themselves with GAM’s, regression splines, and/or MARS, if they haven’t already. In my experience, most statisticians have not.

    I can’t help feeling smug. My very first blog post in this space made essentially the same point! In that blog I tell a story about statisticians delightedly passing around a research paper documenting that logistic regression fitted to clinical-trial data predicted as well as neural nets. Hooray for our side! Yet when I read the article, it referred to splines in the model, putting it in the GAM class. However, I’d never seen any of my colleagues who were so happily taking credit use splines or GAM’s. These statisticians were claiming a victory which wasn’t really theirs.

    My familiarity with splines is partly motivated by Frank Harrell, Jr.’s book Regression Modeling Strategies, in which he advocates for using splines in models whenever possible. (“Possible” meaning “there are enough degrees of freedom available.”) His advocacy implicitly confirms that he has seen many data sets in which relationships were smoothly non-linear. Indeed, most of the book’s examples showed as much.

    In my own experience, I routinely transform continuous predictor variables towards normal, and I find a transformation is warranted in more than half the cases I’ve seen. In modeling, I’ve only rarely found statistical evidence for a non-linear contribution from a transformed predictor. However, because often the predictors were transformed non-linearly, their “linear on transformed scale” contribution should be regarded as non-linear. My experience therefore aligns with Dr. Harrell’s, in a way.

    Now I’m seeing Dr. Rudin’s work, and Dr. Harrell’s experience, and my own experience all aligning: models that satisfy the GAM principle are powerful, statistical, and interpretable ML. I’ll be more likely to try a GAM-type model in the future!

    A technicality about interpretable ML

    By advocating for GAM’s and/or regression splines, I’m laying a claim to interpretable ML that published practitioners in the field might contest. This is because statisticians fit models differently than in the published interpretable ML literature.

    It depends on the type of model, but often they use exhaustive search to find parameters that allow the models to fit observed data. For instance, Rudin describes the logical-rules-generating algorithm CORELS; it searches every discrete value of every variable to evaluate possible rules in order to optimize classification agreement with known truth, plus a penalty per each rule.

    We statisticians are accustomed to maximizing a likelihood function, or using likelihood in a Bayesian calculation, or simply using statistical measures to guide choices. We don’t optimize classification performance directly. And I think there are reasons not to do so.

    I’ve been in too many meetings discussing cutoffs for assays in which someone argues for moving a cutoff a bit to catch one more single point, increasing the assay’s putative sensitivity. But we don’t care about that point: we care about the next point, which we haven’t seen yet.

    Based on experience with cutoffs, I’ve argued for fitting densities to assay signals from populations of negatives and positives and then optimizing those smoothed distributions, rather than hashing over individual observed points. A smoothed ROC curve, if you will. Such smoothing generally improves performance and generalizability where it is applied. I expect that a similar logic would improve interpretable ML optimization methods. (The challenge is in the smoothing, though.)

    Still, strictly speaking, I can’t claim to be using interpretable ML of the soft that’s been published as such, if I’m not optimizing models as the publications indicate. Fair enough, but I propose that statistical model-fitting will perform as well. I contend that it’s the type of model that’s used that matters, more so than the method of fitting.

    But I’m picking nits. Consider what Dr. George Box said, after saying that all models are wrong but some are useful:

    Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

    And my complaint here is no tiger.

    Leave a Reply


Subscribe for e-mail notification

If you like, enter your e-mail address and you’ll receive an e-mail whenever I publish a new essay.

  • I won’t give your e-mail address to anyone.
  • You won’t receive a notification more than once per week. I can’t write faster than that! I don’t publish on a fixed schedule; you should see a notification every two or three weeks.