Welcome to my blog! I’ve been working at what I call “industrial statistics” and exploratory modeling for almost 30 years. I’ve dealt with a large range of data types and problem situations, and I find that my colleagues regard my judgment and insight.
I’ve learned that having a breadth of technical knowledge, but when such knowledge is wedded to good judgment or intuition about meeting the needs of a particular situation or data set, great things can happen. In this blog, I hope to impart some of the intuition I’ve accumulated.
I’m writing this for data analysts and those who work with them. I don’t assume technical knowledge, but I do assume interest in strategies for meeting challenges and understanding features of data-based decision-making. It’s not a “how-to”, but rather a “why-to”.
Please comment if you feel moved. I’m interested to hear your perspective on the topics I’m discussing.
You can be notified of new posts in one of the following ways:
Subscribe to an e-mail notification.
From a Fediverse account (such as a Mastodon account), you can follow {at jimg at replicate-stats.com}. (Put this address in the format @user@site-name. I have to write it this way because this editor automatically converts it to the address to my author profile.)
I’d like to offer thanks to Ambuj Tewari for an excellent presentation “Key Ingredients of the LLM and Transformer Revolution”, and the authors of the book “Dive into Deep Learning” for writing their very useful book and making it available online).
Introduction
An airplane’s tail assembly adds valuable stability, but also considerable drag. Researchers had an idea: could they build an aircraft without a tail assembly and control it instead with a multitude of control surfaces on its wings? It’s a challenging control problem: how would those control surfaces be managed, second by second?
They built an experimental model, and to manage those controls they embedded a static physics-derived model that linked control-surface states to expected aircraft behavior. To get the aircraft to do what was asked, a search system would consult the model to find the requisite control-surface configuration, and that configuration would be sent out to the control surfaces. Bank to the right and turn? What control surface states would induce that change? Make it so.
But the static physics model wasn’t accurate enough. So they added an AI model to estimate the discrepancies between the physics-predicted behavior and actual behavior. Furthermore, these discrepancies could evolve as the aircraft was put through different actions: as the craft climbed, turned, dived, etc., the discrepancies from the physic model could change. To train the AI continuously, the aircraft perturbed the control surfaces continually but slightly, providing a continuous stream of training data.
This probably sounds of a piece with the very many AI stories we hear about currently. But guess what: this story dates to sometime around the year 2000. We didn’t have tens of thousands of NVIDIA GPU’s cranking through terabytes of data then, fitting neural networks with millions of parameters. Rather, in this case, neural network updating was happening in a computer on board an experimental aircraft circa 2000–quite a limited computing environment!
In fact, AI (or what was then called machine learning) has been hard at work, and progressing, for decades. Finally generative AI with large language models started doing creepily intelligent things which blew open the doors to public discourse. Now the street–and the market–is buzzing with the word “AI”, blithely and indiscriminately.
But the phenomena to which we attach the word “AI” is not in fact a homogeneous mass. There are important distinctions to be made, and if we don’t make them, our public and business decision-making will suffer.
Case in point: is the emergence of AI going to add so much electricity demand as to overwhelm efforts to decarbonize our economy? In this article, analyst Michael Barnard argues that the efficiencies AI will generate will more than offset the electricity they consume. (Shout out to CleanTechnica, to which Mr. Barnard often contributes, and to his highly informative podcast–I recommend them both highly.) Without engaging here on the merit of Mr. Barnard’s thesis, he points to two main AI-derived drivers of efficiency:
Better prediction and control of power-using systems. One, obviously, is the grid, with batteries and variable solar and wind added to the energy mix. Another example is your future home with solar panels, a home battery (which may or may not have wheels), and an HVAC system which might be integrated with a hot water heater. There will be many more energy-using systems as well.
Improved and accelerated design of devices and materials using AI.
Of course, AI may enhance efficiency in other ways as well, but these two modes predominate.
It struck me that the sort of AI that can support these two main drivers are not the sort of AI that drives chatbots, prompted images, and on-the-spot essays; in fact, the sort of AI that can drive efficiencies doesn’t suck up a lot of electricity at all.
If I’m right, then talking about efficiency gains and AI without acknowledging various species of AI is like talking about traffic safety while ignoring the fact that small cars and big-rig trucks are different: it’s unlikely that a useful conversation will emerge.
So, rather than engage with Mr. Barnard’s thesis, I’d like to offer such nuance to the public discourse. Yes, bringing nuance to public discourse is already swimming upstream, and is probably doomed. But I live in a lonely little buzzword-free zone, and it’s what I do.
Two characteristics of AI species strike me as salient:
Size: Is the AI application small or large?
Use: Is it predictive, or generative (such as a chatbot)?
Confession
The anecdote that I used to open this essay–concerning a neural net in flight control–is my recollection of a press release or breathless short article I came across many years ago, sometime circa 2000. I’ve searched for documentation or publication about this anecdote but haven’t been able to find it. It’s possible that my memory was inaccurate, or that the short article I read was inaccurate, or both. Nevertheless, my searching surfaced a literature on neural nets in flight control in various contexts, such as automatically adjusting for damage in a fighter aircraft. By the year 2000, there was a substantial body of research on using adaptive (i.e., continuously learning) neural nets in control algorithms in multiple contexts. By then there was also literature on using an ingenious and surprisingly simple optimization algorithm, Simultaneous Perturbation Stochastic Approximation (SPSA), to train neural nets in control algorithms contexts other than aircraft (e.g., water treatment plants), with which I was familiar when I read the press release.
Generative, or merely predictive?
Generative AI is based on predictive models, so to be precise, the dichotomy is not generative vs. predictive, but vs. merely predictive. The “generative” component takes a set of predicted outcomes and selects the most likely one. In essence.
For instance, when we ask AI to generate an image, we give some language-based conditions and it uses a pre-fitted model of a joint distribution of image features and text descriptions. In the generative setting we usually interact via language–the chatbot–which adds an additional layer onto the system.
In a predictive-only alternative, we could take a text prompt and a given image and scored the image as more or less likely. We could have scored two images to see which conformed better to the text. “Generative” takes this to the next level and says “Generate a most-likely image corresponding to the text.”
A predictive-only system could also classify an image as being more likely that of a cat than a dog (say). Another predictive system might take data for a house going on the market–its location, number of rooms, age, etc.–and return an expected selling price. This is the stuff of machine learning going back decades, and doing it well, sometimes, and sometimes not so well.
System control with merely-predictive AI
Predictive-only AI can carry out system-control tasks:
Fit a predictive model that relates vectors of input conditions to outputs. Conditions include both environmental factors and system control settings. E.g., the current condition is the temperature outside, the control setting is the power given to the (variable) AC compressor.
Observe the current environmental conditions; this will become a piece of input, when complemented by a compressor power setting.
Apply a search routine (i.e., an optimization algorithm) to try various control settings, together with the noted environmental conditions, to find the control setting one that gives the best predicted outcome.
This could require many tries, but remember that evaluating the predictive model at a given input configuration is cheap and fast.
It’s guaranteed that the answer will be a realistic system setting, because the optimization algorithm feeds realistic settings. The search will never depart into unrealistic territory (if the algorithm is well-configured). Assuming that the optimizer optimizes, if the model has decent predictive power, then the selected answer will be good.
Alternatively, one could use generative AI with a prompt such as “The temperature is such-and-such; tell me how much power to give the compressor”. This approach would require a vast model fitted to a vast data set, and how it arrives at its answer isn’t easy to say. There’s no guarantee that the answer won’t be “the chair”. (Old joke: “How many surrealists does it take to screw in a light bulb?” “The chair.”) There’s no guarantee that it makes any physical sense at all. Of course, careful practitioners could probably engineer some safeguards. But I ask: why? Why not have those practitioners use their skills to focus on fitting the best-predictive model possible, and make sure the optimization algorithm is appropriately bounded and actually optimizes?
In the example of a green-tech house with solar panels, batteries, hot-water heater, and ongoing energy use, one needs to predict the following, for the near term (e.g., 12-24 hours):
The price of electricity over time (if it varies)
The house’s energy use over time
Then an optimization algorithm could search for an allocation of electricity that optimizes some criterion.
What factors could influence these predictions? Not terribly many: something about likely future weather and likely future energy use. It doesn’t seem that a massively complex model would be required to give good predictions.
Product and material design accelerated with merely-predictive AI
Merely-predictive AI can accelerate product design in collaboration with an optimization routine in a similar manner:
Develop a first-principle model, such as finite-element analysis. Evaluating this model for a configuration could be computationally expensive.
Check the first-principles model against a few prototypes to make sure it’s fairly accurate.
Run the first-principle model at multiple multivariate conditions to support training a predictive AI model, which can return predictions cheaply and quickly.
Note that you won’t have an abundance of data, because each data “observation” is an expensive prototype.
Finding the smallest possible set of conditions that give the most possible information to support fitting a predictive model is something that statisticians do! Many people don’t realize that there are strategies and technologies to do this. Given building prototypes is expensive, call a statistician!
Fit a predictive model to this training data set. Check that it predicts the physical model fairly accurately.
Conduct a search over the input space for the location that gives the best predicted outcome.
Build a prototype with that design to confirm that it performs well.
Again, one could turn to generative AI and simply ask for a design. As before, this may or may not give an answer that is even sensible. And again, I am strongly of the opinion that if a large AI model can make better predictions of outcomes, we should use that model in conjunction with a search algorithm rather than putting that energy (both human and electric) into a generative system.
In other words, the important problems of system control and product design can be addressed by predictive AI, and in many cases small predictive AI. Generative AI offers nothing new here.
“Small” or “Large”?
One aspect of the recent achievements of AI is increase in size: massive models (i.e., millions of parameters) fitted to massive data sets, using massive numbers of computing nodes. Twenty years ago we didn’t do this, in part because we couldn’t do this.
But size alone didn’t do it; shape matters as well. The smaller neural nets of years ago often had a single hidden layer, or a few hidden layers. The newer, larger nets have many (e.g., 15) hidden layers.
There have also been advances in pre-processing data to make it more amenable to modeling, in topic-specific ways. For instance, the “word2vec” process replaces individual words with vectors that indicate clusters, and hence help convey important context. This innovation is not, strictly speaking, part of the neural net, but it has turned out to be an effective way to pre-process data to fit the net to an easier problem.
There have also been innovations in model architecture, although models are still identifiable as neural networks.
Feature discovery
It’s intuitively appealing to think that adding hidden layers to a neural network increases its modeling capacity–that is, it increases the complexity of the phenomenon the model can track effectively. And in fact this is true. What is less intuitive is that, in theory, adding nodes to a single hidden layer in a network also makes the network more capable. They are both “universal approximators”, capable of approximating any continuous surface with arbitrary accuracy, if given enough layers and/or nodes.
A consequence of this is that for any neural net with many hidden layers, there exists a neural net with only one hidden layer that approximates the multi-layer net as well as you like. If you want more accuracy, add more nodes to the single hidden layer; such a network exists.
However, existence says nothing about finding the theoretically-existing network, i.e., determining how many nodes to use and what all the model’s coefficients are. Finding network that theory says exists may be like finding a needle in a haystack.
What adding layers actually does is make the needle easier to find; it shrinks the haystack. It constrains the model, in a way that makes physical sense and is beneficial for the fitting process.
Using many layers should improve prediction when the problem has a natural hierarchy. This is the case for images: an image is comprised of thousands (millions?) of pixels. A pixel is a single input datum, but model behavior shouldn’t hinge on the value of a single pixel. Pixels in a neighborhood form features, such as edges, or textures. Edges and textures may be organized together into shapes, or artifacts. Artifacts in turn may organize into objects. A hierarchical model structure (multiple hidden layers feeding upwards in hierarchy) encourages modeling a hierarchical physical structure.
It is often said that deep neural nets (those with many layers) discover their own features.
Aside: a concept associated with features is invariance. If you have a shape that’s shifted 4 pixels to the left, the model should identify it just as well (assuming it’s not shifted off the edge of the image). A model should be “invariant” to horizontal shifts. Or any shifts. This leads to convolutional neural nets, in which nodes collect data from neighborhoods, and network coefficients for these nodes are required to be the same. This also reduces the number of coefficients to be fitted.
Similarly, if we know in advance that overall image brightness should not be informative (because images are collected under a range of brightness conditions), we could pre-process images to adjust brightness to a constant level. There’s an infamous story in Brian Ripley’s book Pattern Recognition and Neural Networks (a fantastic book for learning important concepts, even if it has grown out of date): a team of neural network practitioners set out to determine if an automated system could discriminate allied from enemy tanks based on aerial photographs. They photographed tanks from the air and fitted a neural net, and found that a surprisingly sparse neural net could discriminate tanks surprisingly well. They collected additional data and found that, instead, the neural net classified no better than a coin toss.
What happened? They had set out the allied tanks on one day, and the enemy tanks the following day. One of those days was sunny, and the other cloudy. The neural net was classifying tanks based on overall pixel intensity.
The most important lesson from this anecdote is that experimental design still matters. (Call a statistician!) A secondary lesson is that if you know in advance that photographs will be taken on bright as well as dark days, and your model should be utterly invariant to this variation, then, if you can, pre-process the data to remove this variability. Change the brightness of each image so it has a fixed average pixel intensity. This normalization may not remove more omplex features associated with sunlight, such as the presence of shadows, but will help substantially.
One could argue that, if a deep neural net can learn features, it should learn that the presence of sunlight is not an important feature. I disagree: yes, given enough data a good model will mostly ignore such a feature, but if you have the ability to enforce invariance, do it! The feature-learning machine will do an even better job assessing the other features, rather than re-learning what you yourself already know.
Learning features can be effective, when there is a natural hierarchy, but it requires substantial data. What if you don’t have so much data? One trick is to train the base layers of a multi-layer net on available data that is not specific to your problem, so that the data “learns the features” of the data type. Then train the upper layers on your particular data.
If no such data is available, then fitting a large multi-layer net will probably not be effective.
One can derive features from inputs manually, as was done before the advent of Large AI, and then use Small AI.
Small AI
At the other end of the AI size spectrum lie machine-learning tools that have been around for some time. Many of them are perfectly fine universal approximators, only they are not hierarchical.
Others parse coefficients into additive effects which may include “interactions”: a “main effect” is an additive contribution to a score based on a factor, irrespective of the values of other factors. “Interaction terms” add to a score based on the two or more factors (but generally not very many), irrespective of other factors. This is another strategy for “making the haystack smaller” that can yield better models if the data-generating process is well-approximated by additive effects. Again, it’s a matter of finding the right model for the data.
Popular “small-AI” machine-learning technologies include:
Statistical regression or classification models, preferably enriched with nonlinear components and select interaction terms. Frank Harrell, Jr. recommends a workflow that avoids overfitting. Generalized additive models serve a similar role. I discussed this in an earlier blog post on modeling clinical data.
Multivariate Adaptive Regression Splines (MARS) uses an ad-hoc search approach for fitting that isn’t optimal, but its flexibility paired with explanatory power may be worthwhile in some cases.
Support vector machines (SVM) were the state of the art in about 2000. They haven’t gotten any worse.
Gradient boosting is a tree-based method that is a leading contender, especially with small data sets. Degree of interaction is specified by the number of splits at each iteration.
Random forest is another tree-based method that is rarely the top performer for prediction, but it is competitive for a broad range of data sets with very little optimization. Meanwhile it is very fast to fit and provides a raft of explanatory information; it’s less a “black box” than a “grey box”.
Gaussian process models, which have an advantage of being able to provide realistic confidence intervals around predicted values.
Artificial neural nets, much as with Large AI except smaller, with one or a few hidden layers, usually perform well among Small AI competitors but require nontrivial configuration choices.
If you don’t need generative AI, and your application doesn’t have natural hierarchy, or you simply don’t have enough data to estimate hierarchical features, then Small AI may be the best solution. Find the right tool for the job.
Small AI has additional advantages:
Much less time is required to fit a model. This means you can get a useful result sooner.
Furthermore, the analyst would probably have time to develop a resampling-based estimate of future predictive performance, via bootstrapping or cross-validation, that is superior to that obtained from a static test set, as used with very large models. Note: I’m not saying the model’s predictive performance is better, but rather that the estimate of the model’s predictive performance is better.
It’s controversial to say that a resampling-based estimate is superior to a validation-set-based estimate. I argue that validation sets are usually too small to be very informative. At any rate, with Small AI one could have both, but with Large AI, resampling is not possible.
Generally Small AI models can be fitted on a laptop. No worries about hardware requirements.
Note: even if you don’t have enough data for a deep neural net to discover features, you can engineer features themselves and add them to the list of data inputs. Feel free to create lots of them; if your modeling is well-controlled, having many inputs will not be a problem. Also, you may explore and transform the input variables as much as you like and you won’t induce bias or overoptimism, as long as your decisions are based on the inputs only, and not on how relationship to outcome is improved.
Which to use, Small or Large?
Certainly generative AI, which necessarily falls on the “large” end of the small/large AI continuum, can do some amazing things. Furthermore, even if an application is solely “predictive” rather than “generative”, recent developments in large models can improve predictive quality in applications where there is natural hierarchy.
It’s intuitively appealing to think that if Large AI can do some things that Small AI can’t, then, among those things that Small AI can do, surely Large AI can do them better?
However, this doesn’t logically follow. A simpler model is simpler because it conforms to some constraining assumptions. For instance, if $Y$ depends linearly on $X$, a strictly linear fit will predict better than a flexible fit. Generally, if a data-generating process conforms to some assumptions, then a model that is simpler because it also conforms to those assumptions will fit better than a more general and potentially more complex model. If the process violates the constraining assumptions, then the more general model may do better. Thus, there are many cases in which a simpler model outperforms a more-complex model. A good analyst finds the method appropriate to the data.
Again:
If the application involves a natural hierarchy and large data sets are available, Large AI may do better.
If there is a natural hierarchy and large data sets are not available, you need to be clever:
Train the base layers of a deep neural net on publicly-available large data sets that are not specific to your problem. Then fix those layers, then optimize the upper layers on your limited data.
Develop features derived from input variables manually, then use Small AI tools.
If there is no clear natural hierarchy, use Small AI.
Am I biased?
Does my training and experience give me an axe to grind, or does it give me valuable perspective? I’ll let you judge.
I’m trained as a statistician. My graduate school training provided a basis but I didn’t take any coursework on artificial intelligence or machine learning. I began learning about machine learning and neural nets while on the job, as I saw a need for flexible predictive systems in automated diagnostic systems.
In one case, a product development team needed a continuous mapping from a three-dimensional color space to a two-dimensional state space as part of their signal acquisition system. I analyzed training data with Generalized Additive Models (GAM’s) and convinced myself that the mapping involved a 3-way interaction among 3 continuous variables. Hence fitting an additive model was pointless: it would provide neither explanatory power (with 3-way interaction among 3 factors, it exhibits no additivity), nor would it predict well. I fitted a single-hidden-layer neural net, and I worked with the software team to implement a neural net “evaluator” function.
The device went on the market after approval by the FDA, neural net and all. It didn’t occur to the business to put a sticker on the box saying “Now with AI!” In fact, some people were concerned that the FDA would object to the use of a neural net, or at least would ask for extra documentation. The FDA did neither.
Incidentally, for that same product, I worked with software to implement the SPSA optimization algorithm in order to calibrate a light source autonomously. The use of optimization in an autonomous system, which I’ve noted in several places in this essay, is a familiar idea.
As machine-learning/AI researchers began handling massive data sets with cloud-based systems and fleets of GPU’s, I didn’t have the appetite to deal with that infrastructure, especially as I was getting along with smaller systems. The data sets I was working with were small. Note that while Big Data has its challenges, Small Data has a different set of challenges.
Perhaps I’m an old curmudgeon railing at newfangled technology, and this essay is just one long rationalization. I don’t think so, but again, you be the judge. I don’t begrudge Large AI its unique achievements. On the other hand, I continue to see opportunities for Small AI. And we should all know the difference.
Summary
In short:
AI is not a homogeneous mass; there are multiple species of AI, in which the chatbots and prompted generation that have caught the public imagination are included.
I propose characterizing particular AI applications as
(Merely) predictive or generative
Small or Large
If the application involves a natural hierarchy and large data sets are available, Large AI is a good choice.
If there is a natural hierarchy and large data sets are not available, you need to be clever:
Train the base layers of a deep neural net on publicly-available large data sets that are not specific to your problem. Then fix those layers, then optimize the upper layers on your limited data.
Develop features derived from input variables manually, then use Small AI tools.
Large AI has increased costs of money, electricity, and time. A wise manager will assess whether the cost is worth it. Find the right tool for the job.
If there is no clear natural hierarchy, use Small AI.
In my work as a non-clinical statistician–in a work setting (pharmaceuticals) where clinical statisticians are the most prevalent variety of statistician–I’ve heard many variations of this story:
A person seeks statistical support.
They turn to the obvious place, the statistics staff within their company, which is wholly oriented towards clinical trials.
Someone agrees to help.
Soon the person needing helping is frustrated that the problem is not adequately solved, feeling that the statistician is not seeing things appropriately in one way or another. In turn, the statistician is also frustrated, or perhaps he or she feels that they’re doing a fine job (I’ve seen both).
To avoid this scenario, you need to know about the archetypes of statistical practice:
The clinical statistician
The industrial statistician
The machine-learning or algorithmic statistician
These are work areas, which call for certain work practices or mental orientations. Any specific statistician naturally adopts the practice pattern for their area of work, or they naturally migrate to the work that fits their orientation pattern. I call these “archetypes” of statistical orientation.
A person can have more than one archetype, and some may be able to adapt their work pattern to the different work at hand. But some strongly live in one area predominately, and would perform poorly in a different context, if they don’t receive specific training.
You may be asking, “Where do data scientists fit into this system?” I first came up with this system long ago, before “data scientist” was a thing, so humor me and let me parse the statistical world first, and then I’ll make an attempt at placing data scientists into it.
A caveat: my career experience has been in the health field (diagnostic devices and, a bit, pharmaceuticals). If you’re coming from a different background, do things appear a bit different to you? I’d be interested to hear your thoughts.
With that, let me describe the three archetypes.
The clinical statistician
The clinical statistician supports clinical studies interpreted in an inherently adversarial context, whether they’re supporting a research manuscript for publication or a drug application to health authorities. They make arguments concerning the degree of evidence contained in the data for or against a hypothesis, sufficient to convince a skeptical audience. Therefore they must make sure everything is in order and all critical assumptions are met.
“Biostatistics” is the sub-field of statistics that focuses on clinical trials. There are challenges that arise uniquely in clinical trials, and statisticians who do not usually work in this area may not be that familiar with some of them.
Making an argument in an adversarial intellectual context is not unlike mathematics. Mathematics is an edifice with fundamental principles at the foundation, and each step up has to be verified so that the entire structure can be trusted as sound. In graduate school I took a required pure math course called “Real Analysis”, which is the theoretical development of calculus. For one exam we were given one hour to prove 10 statements. In that hour I finished proving three of them; for the others I wrote down some thoughts but did not fully prove the statements. In some cases I wrote what I was trying to show, and why the arguments I was imagining weren’t sufficient.
For confidently proving only 30% of the statements, while not saying anything untrue, I earned an “A” on the exam. I demonstrated that I was not extremely creative in math, yet I knew what followed from what.
In other words, a fundamental rule is, “First, say no wrong thing.” Just as in mathematics, it applies when arguing a case to regulatory authorities. In fact, if your audience finds you to be imprudent about the accuracy of your statements, they may trust you less in everything you say.
We’ll see that focusing on saying no wrong thing can be a detrimental orientation in other contexts.
The industrial statistician
The industrial statistician focuses on experiments with many controllable factors, such as in a lab experiment. Their goal is to optimize a product or process, or to develop a predictive model of a system. Here it is critical to determine the most important contributors to the process and to understand different sources of variability. Once the most important factors are identified and characterized, less-important factors are of little interest. Assessing evidence for relationships sufficient to convince a skeptical audience is not important; what is important instead is reaching findings that will move the project forward usefully, even if they’re imperfect approximations of reality.
There is a technology for handling a large number of predictor variables under experimental control, especially where every single observation is expensive (e.g., it requires an entire run of a pilot manufacturing line).
For instance, the field of “fractional factorial designs” can develop a useful local model for 5 controlled factors by collecting only 9 runs (8 runs that perturb all 5 factors and one run in the nominal “center” in order to detect whether a linear model is adequate); an additional 10th run at the center point again would be valuable in order to estimate pure error. Collect those two center points at the beginning and end of the experiment adds information about drift over time. You can see how the goal is to get the most possible information out of the smallest possible experiment.
This 9-run fractional factorial design assumes that interactions involving 3 factors or more are trivial, which is realistic in most cases. 9 or 10 runs stands in contrast to the 25 = 32 runs one might naïvely expect as a minimal perturbation of 5 factors (not counting center points or replication). 32 runs will support estimation of all possible interactions up to 5-way, but it is highly, highly unlikely that all such interactions will be active. Finding a balanced subset of runs in 5-dimensional space is a nontrivial exercise, and one that exercises geometric principles. This is a useful skill, and one that a clinical statistician could very well have no awareness of.
There is an analogy with numerical optimization routines. Note that such algorithms work iteratively, and at each iteration they make a local linear or quadratic approximation to the function of interest. It isn’t critical that the function be truly linear or quadratic; the approximation only needs to be good enough to move the search process forward. The same is true for modeling of experiment data: the model needs to be a good enough local approximation that it moves the project forward; it need not be correct.
In fact, industrial work is usually iterative, in its best form. In fact, a rule of thumb is to spend no more than 20% of your budget on your first experiment, because you fully expect subsequent experiments. Subsequent experiments can incorporate previous findings, and it is efficient not to waste resources on factors or estimates that, given prior data, can be neglected. In a sequential context, correctness is less important than in the clinical context, provided results are correct enough; after all, if the conclusions of one experiment are a little off, the next experiment will refine them.
Where the operating principle for the clinical trial statistician is “First, say no wrong thing,” the first operating principle for the industrial statistician is, “Achieve a working model that is probably inaccurate but is good enough to move the business goals forward.”
There is an art to working with a team to elicit a list of all potential factors, develop a strategy to handle them, and to bring the team on board with the experimental design. Thus there is a bit that is ineffable and social in the practice of industrial statistics.
The machine-learning or algorithmic statistician
Statistics is about learning from data, and standard statistical methods develop this learning by making assumptions that might be more useful than true. Do errors follow a Gaussian (normal) distribution? Are relationships linear? And we begin to realize that these assumptions, while convenient, are not actually required or even the main idea. We can do about as well if we assume that a relationship is smooth, rather than linear, for instance. This direction of interest leads to semiparametric modeling, multivariate clustering, semiparametric density estimation, and predictive models that allow for arbitrary complexity of interaction (neural nets, random forest, support vector machines). This flavor of statistician also takes responsibility for assessing a model’s generalizability to future data.
However, the big dichotomy in statistics is between clinical and industrial; the machine-learning orientation is usually added to one of the others, when it is expressed at all.
Data scientists and statisticians
Where is the boundary between data science and statistics? Is there one? Much ink has been spilled on this so it is probably foolish to pursue it here…but fools rush in, so….
My own experience (with clinical biomarker exploration) suggests the following:
Data scientists are intentional about the craft of programming and managing data. While a few statisticians have intentionally nurtured their programming skills as a craft, the community doesn’t see it as a universal need. A randomly-selected data scientist is likely to be a better programmer than a randomly-selected statistician. Note that a “better” programmer is not someone who knows how to do more things, but rather someone who factors the complexity of a problem into useful modules and organizes one’s code so that it is easy to modify or troubleshoot.
Data scientists take ownership of data pipelines and data handling, more so than statisticians.
Statisticians own the question of inference. If you’re making statistical inference, you’re doing statistics, even if you don’t consider yourself a statistician.
Data scientists tend to take more responsibility than statisticians for understanding the scientific background.
I see a bifurcation in the data science community: there are those whose analysis process is to use loops and a small set of hypothesis tests and plots, then interpret the pile of results that result. Others adopt machine-learning and clustering methods eagerly.
What does this all mean?
In the health-related industry, most statisticians work in the clinical area, and support the clinical archetype. This is very important work with its own idiosyncrasies, and it’s good that these practitioners generally adopt the archetype appropriate to the work. Some of these practitioners may also be able to adopt one or more of the other archetypes when placed into a different context, while others may not.
The proportion of clinical statisticians among statisticians in health-related areas is so high that many in the industry don’t realize that there is anything else. They refer to all statisticians as “biostatisticians” and expect that clinical statisticians can address any statistical need. This is a big error and can lead to business and project issues.
It would be just as big of an error to place one of the other archetypes into a clinical role if they cannot adopt the clinical orientation. However, given that the clinical role is so prevalent and highly developed, this direction of error rarely happens, or at least it is caught right away. It is the clinical-to-other direction that is more likely to be undetected, and to lead to problems. All it takes is a blasé pointy-haired boss to put the wrong person into place and the stage for mayhem is set.
In essence, managers in health-related organizations need to know that not all statisticians are alike.
If you like, enter your e-mail address and you’ll receive an e-mail whenever I publish a new essay.
I won’t give your e-mail address to anyone.
You won’t receive a notification more than once per week. I can’t write faster than that! I don’t publish on a fixed schedule; you should see a notification every two or three weeks.
Leave a Reply