Statistical archetypes

Introduction

In my work as a non-clinical statistician–in a work setting (pharmaceuticals) where clinical statisticians are the most prevalent variety of statistician–I’ve heard many variations of this story:

A person seeks statistical support.
They turn to the obvious place, the statistics staff within their company, which is wholly oriented towards clinical trials.
Someone agrees to help.
Soon the person needing helping is frustrated that the problem is not adequately solved, feeling that the statistician is not seeing things appropriately in one way or another. In turn, the statistician is also frustrated, or perhaps he or she feels that they’re doing a fine job (I’ve seen both).

To avoid this scenario, you need to know about the archetypes of statistical practice:

The clinical statistician
The industrial statistician
The machine-learning or algorithmic statistician

These are work areas, which call for certain work practices or mental orientations. Any specific statistician naturally adopts the practice pattern for their area of work, or they naturally migrate to the work that fits their orientation pattern. I call these “archetypes” of statistical orientation.

A person can have more than one archetype, and some may be able to adapt their work pattern to the different work at hand. But some strongly live in one area predominately, and would perform poorly in a different context, if they don’t receive specific training.

You may be asking, “Where do data scientists fit into this system?” I first came up with this system long ago, before “data scientist” was a thing, so humor me and let me parse the statistical world first, and then I’ll make an attempt at placing data scientists into it.

A caveat: my career experience has been in the health field (diagnostic devices and, a bit, pharmaceuticals). If you’re coming from a different background, do things appear a bit different to you? I’d be interested to hear your thoughts.

With that, let me describe the three archetypes.

The clinical statistician

The clinical statistician supports clinical studies interpreted in an inherently adversarial context, whether they’re supporting a research manuscript for publication or a drug application to health authorities. They make arguments concerning the degree of evidence contained in the data for or against a hypothesis, sufficient to convince a skeptical audience. Therefore they must make sure everything is in order and all critical assumptions are met.

“Biostatistics” is the sub-field of statistics that focuses on clinical trials. There are challenges that arise uniquely in clinical trials, and statisticians who do not usually work in this area may not be that familiar with some of them.

Making an argument in an adversarial intellectual context is not unlike mathematics. Mathematics is an edifice with fundamental principles at the foundation, and each step up has to be verified so that the entire structure can be trusted as sound. In graduate school I took a required pure math course called “Real Analysis”, which is the theoretical development of calculus. For one exam we were given one hour to prove 10 statements. In that hour I finished proving three of them; for the others I wrote down some thoughts but did not fully prove the statements. In some cases I wrote what I was trying to show, and why the arguments I was imagining weren’t sufficient.

For confidently proving only 30% of the statements, while not saying anything untrue, I earned an “A” on the exam. I demonstrated that I was not extremely creative in math, yet I knew what followed from what.

In other words, a fundamental rule is, “First, say no wrong thing.” Just as in mathematics, it applies when arguing a case to regulatory authorities. In fact, if your audience finds you to be imprudent about the accuracy of your statements, they may trust you less in everything you say.

We’ll see that focusing on saying no wrong thing can be a detrimental orientation in other contexts.

The industrial statistician

The industrial statistician focuses on experiments with many controllable factors, such as in a lab experiment. Their goal is to optimize a product or process, or to develop a predictive model of a system. Here it is critical to determine the most important contributors to the process and to understand different sources of variability. Once the most important factors are identified and characterized, less-important factors are of little interest. Assessing evidence for relationships sufficient to convince a skeptical audience is not important; what is important instead is reaching findings that will move the project forward usefully, even if they’re imperfect approximations of reality.

There is a technology for handling a large number of predictor variables under experimental control, especially where every single observation is expensive (e.g., it requires an entire run of a pilot manufacturing line).

For instance, the field of “fractional factorial designs” can develop a useful local model for 5 controlled factors by collecting only 9 runs (8 runs that perturb all 5 factors and one run in the nominal “center” in order to detect whether a linear model is adequate); an additional 10th run at the center point again would be valuable in order to estimate pure error. Collect those two center points at the beginning and end of the experiment adds information about drift over time. You can see how the goal is to get the most possible information out of the smallest possible experiment.

This 9-run fractional factorial design assumes that interactions involving 3 factors or more are trivial, which is realistic in most cases. 9 or 10 runs stands in contrast to the 2⁵ = 32 runs one might naïvely expect as a minimal perturbation of 5 factors (not counting center points or replication). 32 runs will support estimation of all possible interactions up to 5-way, but it is highly, highly unlikely that all such interactions will be active. Finding a balanced subset of runs in 5-dimensional space is a nontrivial exercise, and one that exercises geometric principles. This is a useful skill, and one that a clinical statistician could very well have no awareness of.

There is an analogy with numerical optimization routines. Note that such algorithms work iteratively, and at each iteration they make a local linear or quadratic approximation to the function of interest. It isn’t critical that the function be truly linear or quadratic; the approximation only needs to be good enough to move the search process forward. The same is true for modeling of experiment data: the model needs to be a good enough local approximation that it moves the project forward; it need not be correct.

In fact, industrial work is usually iterative, in its best form. In fact, a rule of thumb is to spend no more than 20% of your budget on your first experiment, because you fully expect subsequent experiments. Subsequent experiments can incorporate previous findings, and it is efficient not to waste resources on factors or estimates that, given prior data, can be neglected. In a sequential context, correctness is less important than in the clinical context, provided results are correct enough; after all, if the conclusions of one experiment are a little off, the next experiment will refine them.

Where the operating principle for the clinical trial statistician is “First, say no wrong thing,” the first operating principle for the industrial statistician is, “Achieve a working model that is probably inaccurate but is good enough to move the business goals forward.”

There is an art to working with a team to elicit a list of all potential factors, develop a strategy to handle them, and to bring the team on board with the experimental design. Thus there is a bit that is ineffable and social in the practice of industrial statistics.

The machine-learning or algorithmic statistician

Statistics is about learning from data, and standard statistical methods develop this learning by making assumptions that might be more useful than true. Do errors follow a Gaussian (normal) distribution? Are relationships linear? And we begin to realize that these assumptions, while convenient, are not actually required or even the main idea. We can do about as well if we assume that a relationship is smooth, rather than linear, for instance. This direction of interest leads to semiparametric modeling, multivariate clustering, semiparametric density estimation, and predictive models that allow for arbitrary complexity of interaction (neural nets, random forest, support vector machines). This flavor of statistician also takes responsibility for assessing a model’s generalizability to future data.

However, the big dichotomy in statistics is between clinical and industrial; the machine-learning orientation is usually added to one of the others, when it is expressed at all.

Data scientists and statisticians

Where is the boundary between data science and statistics? Is there one? Much ink has been spilled on this so it is probably foolish to pursue it here…but fools rush in, so….

My own experience (with clinical biomarker exploration) suggests the following:

Data scientists are intentional about the craft of programming and managing data. While a few statisticians have intentionally nurtured their programming skills as a craft, the community doesn’t see it as a universal need. A randomly-selected data scientist is likely to be a better programmer than a randomly-selected statistician. Note that a “better” programmer is not someone who knows how to do more things, but rather someone who factors the complexity of a problem into useful modules and organizes one’s code so that it is easy to modify or troubleshoot.
Data scientists take ownership of data pipelines and data handling, more so than statisticians.
Statisticians own the question of inference. If you’re making statistical inference, you’re doing statistics, even if you don’t consider yourself a statistician.
Data scientists tend to take more responsibility than statisticians for understanding the scientific background.
I see a bifurcation in the data science community: there are those whose analysis process is to use loops and a small set of hypothesis tests and plots, then interpret the pile of results that result. Others adopt machine-learning and clustering methods eagerly.

What does this all mean?

In the health-related industry, most statisticians work in the clinical area, and support the clinical archetype. This is very important work with its own idiosyncrasies, and it’s good that these practitioners generally adopt the archetype appropriate to the work. Some of these practitioners may also be able to adopt one or more of the other archetypes when placed into a different context, while others may not.

The proportion of clinical statisticians among statisticians in health-related areas is so high that many in the industry don’t realize that there is anything else. They refer to all statisticians as “biostatisticians” and expect that clinical statisticians can address any statistical need. This is a big error and can lead to business and project issues.

It would be just as big of an error to place one of the other archetypes into a clinical role if they cannot adopt the clinical orientation. However, given that the clinical role is so prevalent and highly developed, this direction of error rarely happens, or at least it is caught right away. It is the clinical-to-other direction that is more likely to be undetected, and to lead to problems. All it takes is a blasé pointy-haired boss to put the wrong person into place and the stage for mayhem is set.

In essence, managers in health-related organizations need to know that not all statisticians are alike.

Introduction

The clinical statistician

The industrial statistician

The machine-learning or algorithmic statistician

Data scientists and statisticians

What does this all mean?

Fediverse Reactions

Discover more from Replicate! Statistical planning and analysis

Comments

Leave a Reply Cancel reply

More posts

Welcome!

Hellcats of the machine-learning world

AI, Large and Small

Statistical archetypes

Bayesian methods: why, or when?