What is the evidence for reliability and validity of behavior evaluations for shelter dogs? A prequel to “No better than flipping a coin.”

To best understand this article in the context of the behavior evaluation literature, please see National Canine Research Council’s complete analysis here.

Article citation:

Patronek, G. J., Bradley, J., & Arps, E. (2019). What is the evidence for reliability and validity of behavior evaluations for shelter dogs? A prequel to “No better than flipping a coin”. Journal of Veterinary Behavior, (31) 43-58. doi:

National Canine Research Council Summary and Analysis:

This paper is included because it comprehensively evaluates every study to date that has made validity, and/or reliability, and/or predictive ability claims about animal shelter canine behavior evaluations. The authors aggregate all of the relevant existing data, and summarize the (lack of) scientific support for canine behavior evaluations for dogs in shelters.

As a follow-up to the 2016 paper, “No Better Than Flipping a Coin: Reconsidering Canine Behavior Evaluations in Animal Shelters,” Patronek, Bradley, and Arps* (2019) responded to questions and frequent misunderstandings regarding validity and reliability of existing canine behavior evaluations for dogs in shelters. While “No Better Than Flipping a Coin” (2016) took a hypothetical approach and walked through best case and worst case scenarios, the 2019 “Prequel” investigates and reports on existing, published canine behavior evaluations that are now, have previously been, or are intended for use in animal shelters both nationally and internationally. The report is both thorough and precise as it details what exactly each study reported and pays particular attention to important semantic details that may lead to confusion about an evaluation’s overall “validity.” For an additional look at many of the individual studies discussed in the Prequel, please see National Canine Research Council’s literature review, which also links to summaries and analyses of the included papers.

The primary focus of this investigative paper was to address the question, “I thought it had been shown that a particular behavior evaluation has been validated (or could reliably predict future behavior)?” To do so, the authors systematically review the published literature and summarize instances where validity, reliability, or predictive ability have been reported on, and assess the strength of the claims made. Furthermore, Patronek et al. (2019) contextualize the reports and explain the type of validity that has been reported, as well as the practical or clinical importance.

There were 17 peer-reviewed publications included in the analysis, with a liberal selection criteria of 1) the publication reported on a behavior evaluation that was used or intended for use on dogs living a shelter, and 2) the publication reported on some aspect of validity, reliability, or predictive ability of the behavior evaluation.

The results and the author’s assessment of whether the individual tested criteria were established with strong correlation and statistical significance are summarized in Figure 1 of the article, and explanations for those assessments are shown in Table 1 of the article.

One by one, this review addresses each of these five issues:

  1. Colloquial versus scientific terminology
  2. Limitations of correlation and regression
  3. Predictive validity versus predictive ability
  4. Statistical versus clinical significance
  5. Establishing overall validity

To begin to address the question of interest, the authors first “..illustrate the scope of research needed to satisfy the scientific standard of test validation,” and “focus on attempts to measure reliability and construct validity, each a necessary but not sufficient step for demonstration of overall validity.”

They explain that in order for an instrument to be “valid,” reliability must first be established. There are several types of reliability that should be evaluated with respect to canine behavior evaluations and those include inter-rater reliability, inter-shelter reliability, and test-retest reliability.

Only five studies reported inter-rater reliability, and of the five, only three demonstrated acceptable inter-rater reliability. No study reported inter-shelter reliability. Six studies attempted to establish test-retest reliability (repeatability), but only two were able to do so to any extent. The authors discuss in more detail the shortcomings of these studies and why the test-retest reliability is weak, including a homogenous test sample, subtests without practical significance (e.g., play behavior during tug-of-war was repeatable), and small sample sizes for the second iteration of a test (e.g., 19 dogs from an original sample of 73).

In addition to specific examples from the literature, Patronek et al. (2019) explain theoretical issues with establishing test-retest reliability that are specific to canine behavior evaluations. For example, a dog’s learning history and/or experience in the shelter between tests 1 and 2 may affect dogs’ behaviors on retest. Similarly, if test-retest reliability were established from Time 1 to Time 2 in the shelter, that does not mean that behavior would then be consistent or predictable in a different (e.g., home) environment. Moreover, they note that there is not an established period of time between tests that should be used to establish repeatability.

With regard to reliability, the authors conclude “that demonstrated reliability of any type is largely absent in published studies of canine behavior evaluations.” Because of this, they remind the reader, “that if sufficient reliability cannot be established, validity is moot.”

Nevertheless, they go on to discuss documented (or not documented) validity in the literature.

First, the authors explain the umbrella concept of construct validity—whether or how strongly an evaluation measures what it claims to measure. Convergent, criterion, discriminative, and predictive validity may all fall under construct validity. As its name implies, convergent validity describes the degree to which an evaluation converges or agrees with another existing instrument that is supposed to measure the same thing, with both usually being taken at the same time. Ideally, the existing instrument would itself have been validated; all of the studies reporting on convergent validity used an unvalidated comparison. Patronek et al. (2019) identified four studies that attempted to establish convergent validity, though none were able to do so convincingly. Criterion validity is similar, except that the evaluation is compared to an objective measurable outcome, or to a gold standard. Because there is no gold standard, no studies attempted to establish criterion validity.

Discriminative validity—the ability to detect differences between categorized groups—was demonstrated in three studies. In these studies, the behavior evaluations revealed statistically significant differences in mean or median aggression scores between dogs who had been categorized as aggressive (prior to testing) and dogs who had not.

Finally, predictive validity refers to the behavior evaluation results being correlated with future (or past) behavior, such as in the home. The most important distinction when discussing predictive validity is to not confuse it for predictive ability, which refers to a test’s ability to predict behavior for individual dogs. The authors report that while six studies attempted to establish predictive validity, the results were weak and thus failed to do so. Specifically, among the studies, there were weak-to-moderate correlations, unacceptable sample biases (e.g., only those who passed the first time were re-tested), or the subtests were not practically relevant (e.g., play attempts).

To conclude their in-depth examination of validity of shelter canine behavior evaluations across all studies that report on it, the authors write, “In summary, although a few studies reported statistically significant findings for various aspects of construct validity, none of the studies demonstrated compelling evidence of construct validity in a more global sense.”

Next, the authors address issue 2: the limitations of correlation and regression. They note that correlation and “agreement” are often used interchangeably, though the concepts are not the same, and it is scientifically imprecise to do so. Correlation measures the direction and strength of a linear relationship between two continuous variables; if agreement is the description of choice, then the Bland-Altman (aka Tukey mean-difference plot) method should be used. For categorical data, Kappa is the appropriate measure. Moreover, when reporting agreement, it is imperative to adjust for agreement by chance. The authors illustrate this with an example from the literature where raw agreement was reported for dogs’ biting behavior on an evaluation and behavior in the home which looked quite impressive (81.8%), but when corrected for agreement beyond chance that number dropped to a much less impressive 42%.

In the third section, Patronek et al. (2019) elaborate on the much higher bar of predictive ability versus predictive validity.

“As discussed previously, predictive validity can be established when scores on a behavior evaluation in a population of dogs significantly correlate with a second variable that can reasonably be thought of as dependent on the characteristics being measured, such as future behavior in the home. However, meeting these criteria does in no way imply how accurate the prediction of future behavior will be. Predictive ability, by contrast, reflects the likely accuracy of that evaluation when predicting behavior of individual dogs in the real world, as reflected by the number of errors (i.e., false-positive and false-negative results).”

Predictive ability, they explain, is determined by the test’s sensitivity, specificity, and the prevalence of the target behavior in the population of interest. In the related article, “No Better Than Flipping a Coin: Reconsidering Canine Behavior Evaluations in Animal Shelters,” Patronek & Bradley (2016) dive deeply into the issue of prevalence, and why canine behavior evaluations in shelters are doomed to result in high false positives, even while false negatives are low.

Of the studies that reported predictive ability, all but two did so in owned dogs, which is important given the issues of behavior prevalence affecting sensitivity and specificity, and the role of the testing environment on canine behavior. Still, the results were not impressive:

“…false-positive rates for various individual dog behaviors in the study populations of mostly owned dogs ranged from 11.8 to 53.7% (mean, 35.1%); false-positive rates we estimated for real-world shelter populations using the stated values for sensitivity and specificity ranged from 28.8% to 84% (mean, 63.8%). The mean false-negative rate in study populations was 25.6%, whereas the estimated mean false-negative rate for typical shelter populations was estimated as 8.5%.”

Though the authors do not state so outright, the takeaway from the presented data is clear: acceptable predictive ability has not been established for any evaluation to date.

In the fourth section of this paper, the authors discuss a more commonly acknowledged problem: statistical significance versus clinical or practical significance. To drive their point home, Patronek et al. (2019) display correlation plots of simulated data showing that the same weak correlation (r ≅ .25) can range from “not statistically significant” to “highly statistically significant” by increasing the sample size. Contextually, this can explain how instruments may be confused as “validated” just because statistical significance was reached for some aspect of the evaluation.

Finally, Patronek et al. (2019) tackle the biggest concept of the paper: what it means to establish overall validity. In short, it is not something that can be achieved with one study, one approach, or by establishing one aspect of reliability or construct validity. Establishing validity is a process, with repeated studies of adequate sample size, conducted using the target population, conducted under test conditions that reflect real-world conditions.

The authors conclude that given the lack of established reliability or validity for any existing behavior evaluation, such tests should not be used in animal shelters to determine a dog’s future.

*Please note that the authors are affiliated with National Canine Research Council. Gary Patronek is a paid consultant for National Canine Research Council, and Janis Bradley and Elizabeth Arps are employees of National Canine Research Council.

Abstract and Link to Full Text of the Original Article:

Additional References:

Patronek, G. J., & Bradley, J. (2016). No better than flipping a coin: Reconsidering canine behavior evaluations in animal shelters. Journal of Veterinary Behavior: Clinical Applications and Research, 15, 66-77. doi:

​Page last updated September 23, 2019