Blog

  • Bayesian methods: why, or when?

    Introduction

    There are many compelling theoretical reasons why Bayesian statistical methodology is superior to classical, “frequentist” methods. Under weak conditions, Bayesian estimates are “admissible”, to use a specialized decision-theoretic term. Bayesian methods base inference only on the data actually observed, rather than on the probability of data you might otherwise have seen, but didn’t (the basis for p-values). This makes the study designer’s intentions critical for calculating a p-value, which can lead to all manner of absurdities.

    This begs the question: what are aspects of a problem that make a Bayesian approach well-suited for it? And what is it about the Bayesian approach that makes it so? And–most challenging–can I make an argument that a non-statistician would find useful?

    I need to give a little technical background, and then I’ll offer an example.

    Meet the likelihood function

    Definition

    We observe a data-generating process governed by unknown parameters collectively named \(\theta\). We refer to the data collectively as \(x\).

    If we are considering data that we might get in the future from a process governed by particular values of \(\theta\), we are applying probability. We think about probability distributions, cumulative distribution functions, densities, etc. On the other hand, if we have already observed data and we are now wondering what we can learn about parameters , we are in the statistical estimation and inference world.

    • Here’s a very powerful concept: imagine writing down a mathematical expression for observed data \(x\) given parameters \(\theta\). This is based on prospective thinking. However, if we then consider \(x\) to be fixed, we have a function of variable \(\theta\). This is the “likelihood function”.
      • If observations are independent, this function is a product of each individual observation’s likelihood.
      • Applying the logarithm turns multiplication into addition, so the product of contributions becomes a sum of contributions to the “log-likelihood” function.
    • If \(\theta\) are continuous, the (log-) likelihood function is a curve or surface over all possible \(\theta\) values, some getting a higher likelihood than others. We think of this as the vehicle by which data argue for certain \(\theta\) values over others.

    If likelihood gives weight to values of \(\theta\) that indicate a model that’s more or less consistent with observed data–i.e., the model fits–then the \(\theta\) value that maximizes the likelihood could be a good statistical estimate. And it usually is: this is the maximum likelihood estimate (MLE); this approach has been massively successful.

    Likelihood surface and weight of evidence

    You might also ask whether there are other \(\theta\) values that give likelihood close to the maximum. Is the likelihood surface unimodal? If so, how quickly do values drop off as we move away from the maximizing point? This provides information about the precision of the estimate.

    In fact, you could imagine developing an entire framework for assessing evidence and making estimates using only likelihood and eschewing p-values. Indeed, Richard Royall has done just this.

    Like Royall’s construction, Bayesian methodology also builds an entire inference framework on likelihood, but with an additional ingredient: a “prior” probability distribution reflecting the state of knowledge about \(\theta\) prior to seeing the recent data. This is not a sampling-based probability, or rate; it is probability representing subjective belief.

    Bayesian estimate

    Note that we have two objects that span the space of all possible \(\theta\) values and provide a weighting of such values: the prior and the likelihood. The Bayesian approach proceeds by taking the product of the two. If a true distribution (for \(\theta\)) is needed, we only need to find the denominator for this product that makes it integrate to 1.0. Often this is not required, and in fact in most non-trivial situations we come to understand this distribution by sampling from it.

    There is another very interesting wrinkle: for a given value of \(\theta\), we have a sampling probability distribution for future \(x\). Now, instead of just one \(\theta\), we have a weighting over all possible \(\theta\), in light of past data \(x\). We can develop a “predictive” distribution for future data, based on past data, by doing the following:

    • Generate random sampling draws of \(\theta\) values from the posterior distribution;
    • Given the \(\theta\) values, generate random draws of \(x\) values.

    What makes the Bayesian approach work well?

    What features of Bayesian methodology make it work well, and for what sort of problems? In my experience, they are these:

    • Bayesian estimates (and inference) are stable and usually well-behaved in situations where frequentist estimates are not.
      • This is due to “shrinkage” estimation which generally helps more than it hurts.
    • Inference in complicated scenarios is usually not hard. This is especially valuable when there are many parameters, and we’re entertaining hypotheses about relationships among them. Practically, most Bayesian applications nowadays wind up with samples from the joint posterior distribution, and complicated assessments become a matter of counting, or density estimation.
    • The predictive distribution is powerful for prospective thinking; the frequentist world has nothing quite like it, though it tries.
      • An specification interval or normal range based on a Bayesian predictive distribution is more reasonable and intuitively appealing than a tolerance interval–but that’s a topic for another day.

    Shrinkage, a hedge against gullibility

    A note on stability: While maximum likelihood estimation usually works well, it can exhibit a “gullible” quality with sparse data or small data sets. Take a pathological example: estimate a binomial rate \(p\) based on only one observed data point \(x\); suppose the observed \(x\)
    is a success. No reasonable person would give credence to a parameter estimate based on only one observation, but the MLE mathematically exists, and its value is 1.0.

    A Bayesian posterior mean also exists, and depends on the prior distribution; if we adopt a uniform distribution from zero to 1.0–intuitively appealing, and also analytically tractable–we find that the posterior distribution has the Beta(2, 1) distribution. This distribution has mean 2/3.

    A figure shows a straight line extending from (0, 0) to (1, 2).
    Posterior distribution for binomial \(p\) when the prior distribution is uniform from 0 to 1 and we observe one success in one trial.

    We still shouldn’t give much credence to this value, but note that the methodology is “hedging”, pulling the estimate towards the prior mean, 0.5. The figure above indicates that the single a priori most likely value for \(p\) is 1.0, yet the posterior distribution remains quite diffuse, hence the expected value of the distribution is substantially lower than 1.0. We refer to this as “shrinkage”, and it is inherent in calculating a weighted average over all possible values of \(\theta\) (in this case all values of \(p\)) rather than finding a single maximizing value.

    In fact, the lower tail containing 5% probability can be obtained from a simple probability lookup from the beta distribution, and it is 0.22. Hence a Bayesian analog to a 95% confidence interval is (0.22, 1.0), appropriately very wide.

    Furthermore, we can structure how the shrinkage happens, if not its amount. Do we shrink towards zero a null value, such as zero (for model coefficients) or 0.5 (for probabilities)? If we have multiple parameters, do they all shrink towards a null value, or towards a central value that is not determined a priori? The latter can be accomplished with a hierarchical prior, where parameters of the prior distribution are drawn from a “parent” distribution.

    An example

    Years ago I was pulled into a project to develop QC specifications for batches of incoming material. Since the project was new, there were not many batches available, fewer than 20. Moreover, the manufacturer had changed their process midway through the project, and it was not known whether this change had influenced the subsequent batches. We needed specifications on roughly 20 metrics, so we needed to have an analytical process that was widely applicable.

    For each metric, we have a mean \(\mu_2\) pertaining to batches after the change, and \(\mu_1\) pertaining to before the change. Naturally, for projecting into the future, we’re interested in \(\mu_2\). It’s possible that \(\mu_2 = \mu_1\). Similarly, we have \(\sigma_2\) and \(\sigma_1\), and it’s possible that \(\sigma_2 = \sigma_1\). We deemed it unlikely that variance would change appreciably, and elected to assume a constant \(\sigma\) barring strong evidence to the contrary, although we could have set up a shrinkage framework for two variances. In either case, the pre-change batches provide useful information about variability, and so are not discarded.

    Note that if we elected to allow different variances, the frequentist would probably cast out the pre-change data as irrelevant to predicting future data, leaving a painfully small data set with which to set important specifications. A Bayesian can adopt a hierarchical prior in which the twgo variances are shrunk towards a common value, by a degree that is determined in light of the observed data. This isn’t entirely fair to frequentists; it is possible to contrive some sort of shrinkage estimation without using Bayesian methods. But determining the degree of shrinkage–and rationalizing that choice–remains a weak link.

    I set up a Bayesian ANOVA framework with common variance but a hierarchical prior on the means, to allow the pre-change mean to differ from the post-change mean by degree determined by the data. I then drew samples from the predictive distribution for the post-change group, and calculated quantiles from this sample.

    Some observations from the example

    A few comments:

    • By allowing means to differ, I added a parameter.
      • For a frequentist, this would mean adding another degree of freedom to a search process, potentially increasing gullibility and raising the possibility of overfitting.
        • It also requires a more complex workflow: one must decide whether \(\mu_2 = \mu_1\), then act accordingly. The statistical reliability of the result is now a bit compromised because the final inference doesn’t comprehend this decision-making step.
      • Because the Bayesian estimate is a weighted average over all possible parameter values, the consequence of a more-complex model is not overfitting but rather a more diffuse posterior distribution with increased shrinkage of parameter estimates. We don’t get something for nothing, but with very small data sets, Bayesian estimates “fail safer” than frequentist methods.
        • In some cases analysts applying Bayesian methods to fit single-hidden-layer neural nets do not carry out the cross-validation that is de rigueur for standard machine learning. If you’re an analyst, let that sink in. Bayesian neural nets do not smoothing parameters to be carefully optimized, and they do not overfit, even with many (or too many) parameters.
    • The case above, involving a production change which may or may not induce process change, is an example of ancillary factors complicating primary questions of interest.
    • This comes up often and is an enormous time suck on our economy, especially when the analysis is going to be reviewed by regulatory authorities. The general pattern is this: We want to know if \(A\) is equivalent to \(B\), but we have other factors \(C\) at work (production change, different reagent lots, different labs, etc.) which may or may not have an impact. With frequentist methods, we first make a decision concerning whether to use a large model (accommodating differences in \(C\)) or a simpler model (assuming no effect of \(C\)). Sometimes the conclusion depends on the model, and then what?
    • My proposal: if in doubt, use a hierarchical Bayesian model that allows for ancillary difference in a way that induces data-dependent shrinkage, such that the predictive distribution leaves 99% of future cases within specification.
    • This is an example of a computationally complex application that yields conceptual simplicity and a simpler analysis workflow.

    Some cautionary notes

    Hopefully this essay provides some intuition into what Bayesian statistics does well, and what sorts of problems are amenable. But in fairness, I should also note some challenging aspects:

    • Bayesian methods use the likelihood function. Therefore a specific probability model is absolutely required. If that model doesn’t fit, the system can’t be expected to work well.
    • Similarly, Bayesian methods don’t make any provision for handling outliers. To do so, the model must be expanded explicitly. There have been some very effective efforts in this direction but they have not become widely used.
      • Therefore you’ll need to check data for outliers. If you find them and they are potentially influential, you’ll need to exclude them or winsorize them so they don’t corrupt the model. And then discuss the exclusion and its implications.
    • The problems that benefit the most from shrinkage are exactly the problems that are challenging for the sampling methods that Bayesians have been using for years. Shrinkage helps when there are many parameters, which are collectively important but for which no one individual plays a distinctly strong role. In this case, the likelihood indicates correlation among the parameters. Fortunately sampling technology has advanced, such as with Stan, though Stan allows only continuous parameters.
    • Bayesian methods require a prior distribution. What if you perform a sensitivity analysis and it makes a difference to the conclusion?

    The last point deserves a little discussion. This is perhaps the criticism of Bayesian methods that is most common: subjectivity plays a role.

    Practically, the predominant practice currently is to use very vague prior distributions, so that when the data set is large enough, details of the prior distribution become practically irrelevant. Having the structure of the Bayesian approach can still provide value, particularly with a hierarchical prior that imposes the desired form of shrinkage.

    What if one has a critical question, yet is limited to a small data set, and the prior makes a difference? Then you don’t have enough data for the data alone to determine the conclusion. On the positive side, at least you know this is the case. If you must make a decision, then know that your expert opinion is going to play a role, so get the best experts you can and elicit their opinion carefully (i.e., fit a prior).

    If a prior sensitivity analysis demonstrates that data is not decisive, and a frequentist analysis indicates that data is, which should you believe? I think using the Bayesian approach in this case would lead to better long-term decisions.

    Summary

    If you’re an analyst or someone working with an analyst, your problem could be particularly amenable to a Bayesian approach if it has any of the following characteristics:

    • You have limited data, but the inference questions are critical.
    • You have many parameters, and you’re particularly interested in only some of them.
    • You’re particularly interested in future data, such as specifying a normal range or a specification range.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

  • Analyze continuous data!

    Introduction

    Over and over I’ve found my data analysis strategies to be contrary to those of many of my peers. One area of differing inclination is that, when given continuous data, I try very hard to analyze it as continuous, rather than binning it. My preference is, according to standard principles, the theoretically preferred approach: binning is discarding information and loss of information must lose statistical efficiency. Nevertheless, this textbook advice is almost universally ignored.

    I’m going to argue here that binning is a bad practice, but not for statistical efficiency reasons. Instead, I argue for trying hard to analyze continuous data on the grounds of clarity. The more we meet Nature where she is, the more clearly we can understand her and make reasonable decisions.

    Apologies: shortly after saying this blog is a why-to, not a how-to, I’m including code. It’s optional, so skip it if you like. It’s there because I initially put this blog post on a different platform that, I realized, doesn’t allow images. So I placed the code so people could create their own images. Now that I can provide images, it seems a shame to take the code away.

    library(rms)
    library(mgcv)
    
    ## A crude hand-sketched example
    template <-
       data.frame(x = c(0, 4, 5, 6, 7, 8,    9,  10),
                  y = c(0, 0, 1, 3, 4, 4.25, 4.5, 4.75))
    
    ## A function to linearly interpolate
    tempFun <- 
       approxfun(x = template$x,
                 y = template$y)
    
    ## Generate 
    set.seed(123)
    datbin <- data.frame(x = runif(300, 0, 10))
    
    ## Transform template to probability scale, 
    ## ranging from 0.1 to 0.8
    invLogit <- function(r) 1 - 1 / (1 + exp(r))
    
    LogitLow <- log(0.1 / 0.9) # -2.197225
    LogitHigh <- log(0.8 / 0.2) # 1.386294
    
    datbin$p <-
       invLogit((LogitHigh - LogitLow) / tempFun(10) * tempFun(datbin$x) + LogitLow)
    
    MedianBin <- median(datbin$x) # 4.778207
    
    datbin$XGp <-
       factor(ifelse(datbin$x >= MedianBin,
                     "High", "Low"),
              levels = c("Low", "High"))
    
    set.seed(456)
    datbin$YBin <-
       rbinom(nrow(datbin), 
              size = 1,
              prob = datbin$p)

    Why binning is so seductive

    As mentioned, virtually any academic statistician will argue against binning categorical data on efficiency grounds, and yet the practice is pervasive. Why? I don’t know for sure, but my guesses are as follows.

    First, our customers ask for it. Very often a statistician’s customer is a medical clinical researcher, and many clinical researchers are accustomed to seeing a comparison of discrete groups. In fact, many clinical researchers aren’t even aware that any analysis other than comparing groups is even possible, if the response is not continuous. For instance, if the response variable is binomial with a hypothesized probability p that depends on independent study factors, a non-statistician researcher may imagine that a probability is a property of a group, i.e., a proportion of the whole. To evaluate a probability of an outcome given a study factor, we must define a subgroup based on the factor and then count successful outcomes per subgroup. Modelers know that logistic regression can describe the probability p as changing continuously as a function of other factors: it’s possible for every single study subject to have their own distinct probability p.

    Therefore a clinical researcher asks for a subgroup definition because they prefer to think that way or they’re unaware of alternatives. Next, the pliable statistical analyst may want to please his customer by fulfilling the request as quickly and directly as possible. This raises questions about what a statistical analyst’s ideal role in collaboration is, but that’s for another day….

    Second, binning can simplify analysis. Continuous data cannot be depended upon to be nicely normally-distributed as in textbook cases. There can be non-normal distributions, skewness, outliers, etc.. When relating to other variables, we similarly cannot rely on relationships being linear. All of these problems simply go away when we bin. Deciding not to bin is not a simple choice, but is rather a commitment to cultivate an entirely new toolbox of analysis strategies suitable for messy cases.

    With an analyst caught between customers demanding binning on the one hand, and maybe not fully confident about their continuous-data toolbox on the other hand, perhaps it’s no surprise that most analysts simply provide the customer with what the ask for.

    Still, I challenge statistical analysts to start nurturing that toolbox. If you want to know how, watch this space.

    Statistical efficiency

    Let’s consider the question of statistical efficiency and power with a simple hypothetical example. Suppose we are assessing a possible biomarker in a clinical trial, and the biomarker is assayed on 300 subjects. Biomarker values are uniformly distributed over a range. Suppose the clinical outcome is binary. An increased biomarker value contributes to an increased probability of clinical response, and the true relationship is per the piecewise-linear curve generated as follows:

    plot(0:10,
         invLogit((LogitHigh - LogitLow) / tempFun(10) * tempFun(0:10) + LogitLow),
         type = "l",
         ylim = c(0, 1),
         main = "True probability of response",
         xlab = "Biomarker value",
         ylab = "Probability of response")
    A piecewise-linear curve that is somewhat sigmoidal, starting from a probability of 0.1 (where it remains for some time), and then it increases rapidly to roughly 0.7.  Then it continues increasing, but at a slower rate.  Over the range of the curve (from 0 to 10), it reaches a maximum probability of 0.8.)
    A hypothetical true clinical response for given biomarker values, for simulation purposes.

    Even though the curve is crudely piecewise linear, it has some features that complicate real-world analysis:

    • It is not linear, not even on logistic scale.
    • The probability of clinical response does not reach zero at the low end of the biomarker range, nor does it reach 1.0 at the high end. The biomarker is informative but there are clinical exceptions.
    • At the high end, increasing biomarker value contributes to increased probability, but at a slower rate.

    We want to assess whether the biomarker is worth investigating further, and if so, the nature of the relationship. I’ll carry out three alternative strategies:

    1. Calculate the median value of the biomarker and split cases into biomarker “High” and “Low” groups accordingly. Apply Fisher’s Exact Test to test for association between binary biomarker group and binary clinical outcome.
    2. Fit a generalized linear model relating binary outcome to a smoothing spline estimate of the biomarker’s contribution. Use an approximate test against the null model of no relationship. Optionally, we can assess evidence for non-linearity. Obtain an estimate of the relating curve, with pointwise confidence intervals.
    3. Similarly, fit a logistic regression model, but use a natural spline expansion for continuous biomarker values. Obtain a likelihood-ratio test against the null of no relationship, and as with the GAM, assess the evidence for non-linearity and obtain an estimate of the curve mediating the relationship.

    Should we include fitting a continuous model that assumes linearity? I think not, because we don’t know if that’s the case, and a non-linear relationship is quite plausible.

    With alternative (1), Fisher’s Exact Test gives a p-value “< 2.2e-16”. Statistical efficiency is not an issue here.

    (The following code evaluates Fisher’s Exact Test with the X grouping against the binary outcome.)

    fisher.test(datbin$XGp, datbin$YBin)

    With the GAM–alternative (2)–we also find <2e-16 as a p-value against the null hypothesis of no relationship. The following code generates an estimate of this relationship:

    GamMod.init <- 
       gam(YBin ~ s(x),
           family = binomial, 
           data = datbin)
    
    summary(GamMod.init)
    
    plot(GamMod.init,
         trans = invLogit,
         ylim = c(0, 1),
         main = "Logit outcome vs biomarker",
         xlab = "Biomarker",
         ylab = "Logit of outcome probability")
    An estimated curve that tracks the previous figure recognizably but is implausibly wiggly.  Confidence bands lie above and below the curve.

    This tracks the true curve but is too wiggly to be plausible. Manually forcing the smoothing parameter to be large enough so that the curve is almost monotonic, we find:

    GamMod <- 
       gam(YBin ~ s(x, sp = 0.05), 
           family = binomial, 
           data = datbin)
    
    plot(GamMod,
         trans = invLogit,
         main = "Logit outcome vs biomarker",
         xlab = "Biomarker",
         ylab = "Logit of outcome probability")
    A curve that's  less wiggly than previous, in fact it's plausible and it tracks the true curve well although there is a bit of curvature in the left end that is not present in the real curve.
    A GAM fit that tracks the true curve reasonably well.

    The actual predicted probabilities are not very different between these estimates, actually, and they both give p-values of <2e-16 against the null model.

    ## initial model
    summary(GamMod.init)
    
    Family: binomial 
    Link function: logit 
    
    Formula:
    YBin ~ s(x)
    
    Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.8180     0.1548  -5.284 1.26e-07 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Approximate significance of smooth terms:
           edf Ref.df Chi.sq p-value    
    s(x) 7.433  8.393   72.3  <2e-16 ***
    
    ## Add a linear component to the initial model so that 
    ## linear and non-linear components can be assessed
    ## separately.
    GamMod.init.lin <- 
       gam(YBin ~ x + s(x),
           family = binomial,
           data = datbin)
    
    summary(GamMod.init.lin)
    
    Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  0.00000    0.00000     NaN      NaN    
    x           -0.16372    0.03098  -5.284 1.26e-07 ***
    
    Approximate significance of smooth terms:
           edf Ref.df Chi.sq p-value    
    s(x) 7.433  8.393  82.88  <2e-16 ***
    
    ## smoother model
    summary(GamMod)
    
    Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.8062     0.1509  -5.344 9.08e-08 ***
    
    Approximate significance of smooth terms:
           edf Ref.df Chi.sq p-value    
    s(x) 4.212  5.192  69.73  <2e-16 ***

    The initial and the smoother GAM models both show a p-value for the relationship between X and outcome of <2e-16. Additionally, a linear term is added to the initial model in order to assess the non-linear contribution separately. This shows that p-values for the linear and the non-linear contributions are both very small.)

    Applying alternative (3), unpenalized spline regression, we obtain the following curve:

    ## Use rms package to enable nice ANOVA
    RegModBin <- 
       lrm(YBin ~ rcs(x, parms = 5), 
           data = datbin)
    
    ## Use base or "stock" glm to support likelihood ratio
    ## test via base anova function
    
    RegModBin.s <- glm(YBin ~ rcs(x, parms = 5), data = datbin)
    
    ## Plot
    ## Set of points on X axis for plotting
    TmpSeq <- seq(0, 10, length = 200)
    
    ## Get predictions on logistic scale, then calculate
    ## confidence limits on that scale, then transform
    ## to probability scale
    Preds <-
       predict(RegModBin,
               newdata = data.frame(x = TmpSeq),
               type = "lp", 
               se.fit = T)
    ## has components "linear.predictors" "se.fit"
    
    CIReg <- 
       data.frame(Est = invLogit(Preds$linear.predictors),
                  Low = 
                    invLogit(qnorm(0.025,
                                   mean = 
                                   Preds$linear.predictors,
                                   sd = Preds$se.fit)),
                  High = 
                    invLogit(qnorm(0.975,
                                   mean = 
                                   Preds$linear.predictors,
                                   sd = Preds$se.fit)))
    
    plot(range(TmpSeq), c(0, 1), type = "n",
         main = "Outcome probability vs. biomarker",
         xlab = "Biomarker",
         ylab = "Outcome probability")
    polygon(x = c(TmpSeq, rev(TmpSeq), TmpSeq[1]),
    	y = c(CIReg$Low, rev(CIReg$High), 
                  CIReg$Low[1]),
    	col = "thistle", border = NA)
    lines(TmpSeq, CIReg$Est)
    rug(datbin$x[datbin$YBin == 0])
    rug(datbin$x[datbin$YBin == 1], side = 3
    A curve strongly resembling the previous, with confidence bands given in thistle color.
    A regression spline fit using; no penalization is applied here.

    In this plot the Y-axis is on the probability scale rather than the logit scale. This is substantially equivalent to either GAM model. Here we can give an informative ANOVA breakdown:

    ## Generate the ANOVA table
    anova(RegModBin)
    
    Factor     Chi-Square d.f. P     
    x          69.85      4    <.0001
     Nonlinear  9.89      3    0.0195
    TOTAL      69.85      4    <.0001
    
    ## Likelihood ratio test using base R
    anova(RegModBin.s, glm(YBin ~ 1, data = datbin))
    
    Analysis of Deviance Table
    
    Model 1: YBin ~ rcs(x, parms = 5)
    Model 2: YBin ~ 1
      Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
    1       295     48.948                                 
    2       299     68.250 -4  -19.302 29.083 < 2.2e-16 ***

    This indicates that there is overwhelming evidence that the biomarker influences the outcome, and furthermore there is strong evidence of a nonlinear component, i.e., a departure from linearity on the logit scale. While this representation doesn’t show exactly how small the p-value is, a standard likelihood-ratio test yields <2e-16, just as the other methods.

    In summary, all three approaches indicate strong evidence that the biomarker influences the clinical response. This does not support the idea that the continuous approach is more powerful. The median split represents the biomarker with 1 degree of freedom, while the continuous approaches use roughly 4 degrees of freedom. They yield more information, but “cost” more. It’s a fair trade, but it’s not clear that one always has more power than the other.

    Rather, the reason that I recommend a continuous approach is that, in one step, we (1) assess evidence for a non-null relationship and (2) gain a reasonable estimate of that relationship. Further, we do the latter without carrying out substantial optimization or multiple looks at the response, which compromises statistical reliability.

    Now let’s think a little more about real life.

    Representing nature

    Here’s a true story illustrating how failure to look at continuous data can lead to self-imposed confusion and obfuscation, also cost real money, and delay important projects.

    As the resident expert in clinical diagnostic assays in a large pharmaceutical company (that’s not saying a great deal when the company didn’t nurture such expertise), I was pulled into an apparent assay issue impinging on an oncology clinical trial. The assay measured gene copy number (GCN) for a specific gene; a GCN value above a specified cutoff was a study enrollment criterion. That is, it was a companion diagnostic assay (CDx) for the therapy under study. Two labs carried out testing for the study, each serving a different geographic region.

    Recently there had been some operational issues with the assay which had required troubleshooting; the assay vendor had confirmed the issue, put a fix in place, and, for good measure, both labs repeated operator proficiency validations. Then patient screening for the study resumed. After some time, however, the trial team noticed that the “prevalence” (incidence of GCN over cutoff) was higher at one lab than the other. It was decided to pause study enrollment once again.

    Note that for a pharmaceutical company, completing trials quickly is the coin of the realm; pausing a trial was a Very Serious Matter. Meetings were held, numbers were compiled, and still bigger meetings were held. Finally this expanding process grew to include a bystander previously unaware of the entire study, i.e., me.

    When I joined my first meeting, it was chaired by the head of Oncology, which, for the company organization at the time, reported to the CEO. Lots of highly-paid senior people were there; this was one of those meetings where the cash clock ticked quickly.

    I was given the information that had been compiled up to then. This included assay positivity counts at each site. I asked, “Where are the GCN numbers for each of the sites? What does the GCN distribution look like at each site?”

    Such information had not been compiled!

    Consider for a moment what this indicates about priorities and corporate culture. The company was expending significant resources, pausing a trial and using a top executive’s time. It would have been the simplest thing to organize GCN values–they were available in the clinical database, waiting to be visited. There’s no question that the values actually measured would give a more complete representation than the processed values. Yet this value was not widely shared. If this describes your organization too, you have work to do!

    As an aside, there’s another aspect to this: when you’re troubleshooting a data-related issue, investigate the data process from first information acquisition to final result before you invest time in a lot of other approaches. At what point does the data begin to look anomalous? Or, if you prefer, start at the end and work towards the beginning. The point is to be systematic and to “scan” the whole process to come to an understanding of the state of things. Looking at continuous data often means looking upstream; I also have an expensive war story about this. The error is committed again and again.

    But back to our story: in not much time the team pointed me to GCN numbers and I was able to determine that the clinical cutoff for GCN was very near the median GCN value. This is not necessarily an error, but it is definitely problematic: a small shift in the distribution of measured values, such as can easily happen with many assays, will induce a substantial change in positivity rate. I fitted density estimates; they had similar shapes but one was shifted slightly higher than the other. If I shifted the higher density down by the difference in medians, the densities lined up quite well, and furthermore the apparent positivity rates closely agreed.

    This difference in median was smaller than the noise in the assay. Thus, an assay scientist wouldn’t worry too much about it. She certain would recommend against placing the assay in a context where a trivial change (relative to assay variability) would be interpreted gravely. When the clinical cutoff was suggested, red flags should have flown.

    The team decided to carry out a paired sample study. Regulatory requirements prevented either lab from sending clinical samples to the other lab, but the assay vendor could split samples, test them, and send them to each lab. Then we could compare each lab to the vendor. Long story short, it turned out there was a small difference between the vendor and both labs, and in the same direction, but this difference was not meaningful. The trial resumed. Frankly, while there was an observed difference, when interpreted with quantitative data and an understanding of variability, there was no real issue. By looking at processed data and not turning quickly to the underlying quantities when questions arose, the team had gotten themselves in very costly tizzy.

    Here’s what I’ve taken away from these experiences:

    • Model continuous data when possible. It’s probably closer to the data that Nature gave you than binned or otherwise processed data.
    • Transforming such data towards a Gaussian distribution is fine, in fact it’s often beneficial.
    • Analyzing data close to what Nature gave you will give you a more complete assessment. It may have some analysis complications, but it will give the decision-making team more confidence.

    For more on this topic, check out Frank Harrell Jr.’s essays How to Do Bad Biomarker Research and a chapter on Information Loss.

    Leave a Reply

    Your email address will not be published. Required fields are marked *


Subscribe for e-mail notification

If you like, enter your e-mail address and you’ll receive an e-mail whenever I publish a new essay.

  • I won’t give your e-mail address to anyone.
  • You won’t receive a notification more than once per week. I can’t write faster than that! I don’t publish on a fixed schedule; you should see a notification every two or three weeks.