Update (September 23, 2019): Since we published this literature review in National Canine Research Council’s Research Library, a comprehensive, detailed analysis of the (lack of) reliability and validity of behavior evaluations as they pertain to dogs living in animal shelters has been published in the peer-reviewed literature. We have added a summary and analysis of that paper here: What is the evidence for reliability and validity of behavior evaluations for shelter dogs? A prequel to “No better than flipping a coin.” Authored by three researchers affiliated with National Canine Research Council, Patronek, Bradley, and Arps (2019) systematically examined claims of 17 peer-reviewed articles (many of which are discussed in this review as well). Patronek et al. summarized what has been reported, analyzed the strength of any claims made regarding reliability and/or validity, and provided context for said claims. The authors addressed five major issues concerning reliability and validity: 1) colloquial versus scientific terminology, 2) limitations of correlation and regression, 3) predictive validity versus predictive ability, 4) statistical versus clinical significance, and 5) requirements for establishing overall validity. Patronek et al. determined that neither reliability nor validity have been sufficiently demonstrated for any of the existing behavior evaluations, and thus do not recommend their use for dogs living in animal shelters. These important findings should be considered and kept in mind when reading our review.

In this literature review, we will concern ourselves primarily with the applicability of formal provocative behavior tests (behavior evaluations) used to attempt to predict the behavior of a dog residing in a shelter after he is placed in an adoptive home. We review here the various attempts to determine the validity and usefulness of such tests. However, behavior evaluations (subjecting dogs to various stimuli that are hoped to simulate likely real life events, and recording their responses) are conducted on various populations of dogs for various reasons. In some cases, these tests are performed on owned dogs, usually as an attempt to answer research questions regarding whether various dog demographics (e.g., size, age, sex, reproductive status, breed) correlate with test results (one is discussed in De Meester et al., 2008) or to guide decisions on which dogs should be included in breeding programs (one is discussed in Netto & Planta, 1997). Some instruments have been used for both purposes, (one is discussed in Fuchs, Gaillard, Gebhardt-Henrich, Ruefenacht, & Steiger, 2005). These evaluations are not attempts to predict behavior, but rather to provide efficient measure of behavior without relying on subjective owner reports. A few attempts have been made to validate such evaluations by comparing them with owner surveys of their dogs’ behavior (see Barnard et al., 2012 and Bennet et al., 2012). Owner surveys in general will be discussed in a separate literature review.

In the United States and some other countries, a third primary purpose is at work (although, at present, there is no formal data on how frequently this goal is in play). Approximately 3.9 million dogs enter animal shelters each year in the US ( Some of these shelters—it is unknown how many—administer behavior evaluations to guide decisions regarding which dogs should be made available for adoption and/or under what conditions (e.g., placing dogs only after behavior modification to alter problematic behaviors, matching them with households where evaluation scores suggest they are likely to be successful, or simply informing adopters of evaluation results.) In other words, the purpose of the evaluation of a dog residing in a shelter is to predict how he will behave once placed in a human home. Shelter motivations for administering and making decisions based on such tests may include concerns about avoiding liability and avoiding compromising public safety by adopting out dogs whose subsequent injurious bites could have been predicted. Another frequently expressed motivation is the previously mentioned matchmaking goal, along with having a rationale that does not appear entirely arbitrary for culling dogs in shelters that take in more dogs than can be housed. No research has yet been completed on why shelters adopt or maintain this practice, however, so this remains speculative and anecdotal. We have some limited information (Mohan-Gibbons, Weiss, & Slater, 2012) on numbers of shelters using behavior evaluations; however, it is unclear whether these results are representative of the shelter industry as a whole.

Whether administered to owned dogs or dogs living in a shelter, a behavior evaluation assumes that 1) the dog’s temperament is a permanent attribute and this is what the test is measuring, 2) the dog’s behavior in a semi-controlled setting with provocative stimuli will represent or even predict behavior in the home and in public, and 3) that specifically in shelter administered tests, potential adopters are concerned about the behaviors being tested for. In recent years, scientists have begun to question these assumptions and doubt the usefulness of such tests. The literature is rapidly growing as researchers attempt to identify and clarify the science behind behavior evaluations.

Shelter workers often hope to find dogs who match prospective adopters’ lifestyles and what types of behaviors they might expect down the road. Will the dog be playful and engaged or independent and mellow? Will the dog interact well with other dogs and be safe around cats? How will the dog respond to energetic children? Will the dog behave “aggressively” in a home, often broadly defined as growling, snarling, snapping, lunging, or biting.

The methodology and criteria for determining the efficacy of diagnostic tests are well established in the realm of behavioral and medical assessments of human beings. Such efficacy determinations include determining the reliability (i.e., the replicability of the results) and validity (i.e., does the test actually measure what it is intended to measure). These questions are addressed by rating a test’s sensitivity (what proportion of the individuals who actually exhibit the condition of interest does the test identify) and specificity (what proportion of individuals not exhibiting the condition does it rule out). These measures, along with the rate of occurrence of the condition in the population (called prevalence) can then be combined to determine how likely a positive or negative diagnosis is to be correct in individual cases and in the aggregate. To the very limited extent that such analysis has been done with regard to canine behavior evaluations, the results have not been encouraging (e.g., Van der Borg, Netto, & Planta, 1991; Netto & Planta, 1997; Christensen, Scarlett, Campagna, & Houpt, 2006). In other words, all of the evaluations analyzed so far lead to unacceptable levels of false positives, i.e., dogs misidentified as more than normally likely to growl, snarl, snap at or bite humans in particular, and these erroneous results often result in an unnecessarily longer stay in the shelter or euthanasia of the misidentified dogs.

Several studies have examined the validity and reliability of specific evaluations. Though many report their evaluations to be valid, the data often do not hold up under scrutiny. The evaluations have detected variable behavior between dogs, but these differences are not always reliably reproduced over time (Bennett, Weng, Walker, Placer, & Litster, 2015), between raters (Diesel, Brodbelt, & Pfeiffer, 2008), or between different tests (Bram, Doherr, Lehmann, Mills, & Steiger, 2008; Haverbeke, Pluijmakers, & Diederich, 2015; Rayment, De Groef, Peters, & Marston, 2015) and it is becoming increasingly clear that the behaviors observed during evaluations, particularly those conducted in shelters cannot reliably predict future behavior in homes (Marder, Shabelansky, Patronek, Dowling-Guyer, & D’Arpino, 2013; Mohan-Gibbons et al., 2012).

In this review, we will discuss attempts to establish reliability and validity in tests administered to owned dogs and dogs living in shelters separately, as in the first case, currently expressed behavior is being assessed, in contrast to the latter, which is an attempt to predict behavior in an entirely different environment. There is no reason to suppose that validation with one population would carry over to the other. We will also review—in tests on each type of population—attempts to establish various tests’ reliability, either between raters or over time. Such reliability is a necessary, but not sufficient, condition for establishing validity. Finally, we will discuss a recent statistical analysis of the limitations of behavior evaluation validity in shelters.


Examples of attempts to determine reliability and validity of canine behavior evaluations


Population: Owned Dogs


Test-retest reliability and inter-rater reliability (IRR) have been evaluated for a number of behavior assessments. Three commonly used tests (“child-like doll,” “fake dog,” and “ambiguous object”) were assessed by Barnard, Siracusa, Reisner, Valsecchi, & Serpell (2012) in tests of owned dogs. The purpose of the study, which compared the subjects’ behavior on the tests to the dogs’ known behavior, was meant to assess whether exposure to these fake objects correlated with the dog’s known history of interactions with children and other dogs. The only mention of an actual application for these tests was in determining which dogs would be adoptable to homes with children. The only reliability measure in this study was a score-rescore measure using the same evaluator. The authors had one coder re-score a sample of videos 8 months after the initial coding which showed intra-rater repeatability for that coder.

The 2005 study by Svartberg, Tapper, Temrin, Radesater, and Thorman used the Dog Mentality Assessment (DMA) to study personality in canines. Personality refers to behavioral consistency within an individual, so the researchers administered a behavior evaluation several times to the same dogs in different locations. The five personality traits that were investigated were Playfulness, Chase-proneness, Curiosity/Fearlessness, Sociability, and Aggressiveness. The researchers did find consistency over time, demonstrating the test-retest aspect of reliability.

Subjects were 40 pet dogs (20 males). Standardized scoring sheets were used to record dogs’ behavior on 33 variables across 10 subtests, with scores ranging from 1 (low intensity) to 5 (high intensity) for each variable. The ten DMA subtests ranged from the mundane (e.g., approaching a stranger, a short walk), to playful (e.g., tug of war, chasing a rag) to startling (e.g., the sudden appearance of a dummy, loud metallic noise, and approaching “ghosts”). The tests were spaced temporally with approximately one month between the first and second, and second and third trials. Measured behaviors included, but were not limited to, greeting behavior, attention towards a stimulus, interest in play, startle reaction, avoidance behavior, aggressive behavior, exploratory behavior, and activity level.

The results raise legitimate questions about the reliability and validity of behavior evaluations, particularly their potential for screening tools with regard to dangerous behavior. The internal consistency for Aggressiveness was lower (alpha = 0.67) than any of the other personality traits (ranging from 0.80-0.89) and “aggression” is typically the trait of interest for behavior evaluations. Most of the traits (Playfulness, Chase-proneness, and Sociability) were found to be consistent over time, but Curiosity/Fearlessness increased and Aggressiveness decreased over time. Both the Curiosity/Fearlessness continuum and Aggressiveness resulted in significantly different scores between tests 1 and 2. This again calls into question the use of similar tests to determine whether a dog is adoptable. If such responses cannot be reliably measured over time, and are higher on the first examination, the predictive usefulness is questionable.

Klausz, Kis, Persa, Miklosi, and Gacsi (2014) describes the development of a “Quick Assessment Tool” (QAT) for predicting biting and snapping behavior in pet dogs. The QAT included 5 tests that ranged from friendly greetings to the owner attempting to hold the dog on its back for a minute. As a reliability measure, 19 (of the original 73 subjects) were re-tested approximately one year later, and no differences between the results of the two tests were found, indicating test-retest reliability within this sample of owned dogs. Whether this test-retest reliability would occur among dogs whose living situation had changed (e.g., from living in a shelter to living in a home) is unknown. In Kis, Klausz, Persa, Miklosi and Gacsi (2014) for example, researchers found that the sensitivity of a behavior evaluation could be strongly affected by the presence or absence during the test of a person with whom the dog had formed a bond.

A 2008 study by Bram et al. compared three behavior tests that are used to assess “aggressive” behavior in Europe and elsewhere, primarily for use with pet dogs legally designated as members of “dangerous breeds.” The purpose of this research was not to validate any or all of the tests, but rather to determine whether the three tests, which all claim to accurately evaluate canine behavior, come to the same conclusions for a given dog, in other words, to assess inter-test reliability. Two of the tests, the test of the American Staffordshire Terrier Club of Switzerland and the Test of the Canton of Basel-Stadt, Switzerland have legal force in that country. The third, the “dog handler test,” does not.

Sixty pet dogs of various breeds and mixed breed were evaluated on interspecific and intraspecific behavior using the three behavior tests. There was high agreement across the three tests for “open, friendly, neutral behavior.” Most dogs received this score on all three tests. Overall, there was poor agreement across the three tests for intraspecific behavior (46%) and interspecific behavior (55%). Because so few dogs in the sample showed any aggression (which was undefined, and up to the subjective judgement of the scorers), the dogs that scored a 2 or higher on any test were grouped together under the label “potentially aggressive,” meaning the scorer had judged them to have exhibited “dominant behavior or mistrustful, fearful behavior, without aggression”—emphasis National Canine Research Council).  These dogs were further analyzed with pairwise comparisons. In other words, the authors maintained the assumption, unproven anywhere in the literature, that it is possible to identify “potentially aggressive behavior” in dogs that neither threaten nor bite. Inter and intraspecies behavior analyses remained separate. Despite small sample sizes (n = 23 and 29 for inter-specific and intra-specific behavior, respectively) the data showed significant differences in the results between tests A and B (P = 0.035) and between tests B and C (P < .001), and the results were inconclusive between tests A and C. Overall these three commonly used tests did not exhibit intertest reliability in predicting interspecific aggression (towards humans). The authors concluded that such low levels of reliability suggests that such tests currently in use are of insufficient validity to be used in legal decision making about the disposition or conditions of keeping individual pet dogs.

Not every study in the literature reports data on reliability. In their 2012 manuscript, Bennett, Litster, Weng, Walker, and Luescher discuss the importance of reliability (both test-retest and interrater). However, they did not include measurements for either, so reliability was not included other than to comment on its necessity. Because these tests are used to influence adoption and euthanasia decisions, reliability, while of course not sufficient to establish validity, should be included in every study that reports on them.


The validity of behavior evaluations, e.g., the issue of whether the test is measuring what it was designed to measure, a stable and therefore, predictable response to social stimuli, has been studied using pet dog samples. Validity of the “approaching stranger,” “child-like doll,” “fake dog,” and “ambiguous object” tests was assessed in Barnard et al.’s (2012) study by comparing 34 dogs’ behaviors on the tests to their scores on behavior toward strangers, children, other dogs, and novel objects on an owner reporting instrument, the Canine Behavioral Assessment and Research Questionnaire (C-BARQ). Once again there is a disconnect between the subjects used and the target population; these tests are meant to be predictive of shelter dogs’ future behavior, but they are validated on pet dogs whose histories are known and whose owners are present for the tests. The validation was retrospective rather than predictive. Although they are cautious with their conclusions, the authors do purport that their hypotheses were partially supported in that the three groups of dogs (those with a history of child aggression, dog aggression, and no aggression) behaved differently on the child-like doll and fake dog tests, although there was overlap. However, all the correlations included multiple false negatives and positives, and even inverse relationships. For example, dogs that did not have a history of dog aggression had higher aggression scores than dog-aggressive subjects on the fake dog test. Specificity and sensitivity of the tests were not calculated, perhaps because of the very small sample sizes in each category.  Unsurprisingly, owners who reported biting or attempted biting behaviors on the three study categorizing questions, also reported these behaviors on the C-BARQ, but it is difficult to see how this constitutes a validation metric. As is usually the case among evaluations of owned dogs, the owners were present (holding the dog on leash) during the subtests, a factor that has been shown to influence behavior in other kinds of studies. And of course, dogs in a shelter do not have a familiar, trusted individual with them during such evaluations, so one should expect different behavior patterns between the two populations.

Similarly, Bennett et al. (2012) also assessed 73 pet dogs’ (6 of whom were excluded from the study for various reasons) behaviors on two evaluations (Meet Your Match (MYM) Safety Assessment for Evaluating and Rehoming (SAFER)) and a modified Assess-A-Pet (mAAP)) and compared their scores to the dogs’ behavior histories as measured by C-BARQ. A series of statistical tests were conducted to determine sensitivity, specificity, false positives, false negatives, and odds ratios for both instruments. Although mAAP had somewhat better sensitivity (0.73) and specificity (0.59) than SAFER (0.6 and 0.5, respectively), such levels yield unacceptable numbers of both false positives and false negatives to be of predictive value for use in making decisions about the disposition of shelter dogs, particularly since within such a population, the specificity and sensitivity would be likely to be even lower, if the behavior evaluations were validated against a owner survey like the C-BARQ after the dogs’ environment had undergone the dramatic change from shelter to home.

Although 28 dogs in the sample had been adopted from shelters, all had been in their current homes for at least 3 months, and no information was available on whether any had been previously tested using either instrument, or, if so, what those results might have been.

Klausz et al.’s (2014) paper describes the development and validation of a “Quick Assessment Tool” (QAT) meant to predict biting and snapping behavior in pet dogs. Subjects were 73 adult pet dogs categorized as non-biting controls (NB), dogs who had bitten once (OB), and dogs who had bitten multiple times (MB). However, the operational definition of a biter was one who had “bitten or snapped at a person at least once in their lives.” Biting and snapping are distinct behaviors, however, and only biting is even potentially injurious. Given this definition, nearly two thirds of the 73 dogs studied are labeled as either OB or MB. No distinction was made regarding bite severity or motivation (e.g., play bites). If the purpose of such behavior identification is ultimately human safety, it is difficult to see how this can be achieved absent such distinctions. Biting behavior and history of “aggression” were provided by the owners and the nature of the questions required a long recall-period. The typical biases of human memory and long recall-periods apply, as well as the general shortcomings of self-reported data.

The subjects were administered five tests in a fixed order and all trials were videotaped for further analysis. The five tests were friendly greeting, take away bone, threatening approach, tug of war, and roll over. After the short test battery, owners were asked to indicate how often their dogs behave “aggressively” towards strangers and towards family members (each on a 1-5 scale). “Aggressive” was defined as growling, snarling, snapping, or biting. Results from the assessment were compared to the dogs’ reported biting history and owner-reported “aggression,” however, having the owners present during the evaluation could have biased their responses on the later questions. For example, if an owner observed their dog behaving aggressively during the test, they may be more likely to remember or report previous instances of biting behavior, whereas owners who saw their dog behaving non-aggressively during the test may be more likely to dismiss previous instances of aggression.

Behavior during the QAT did not correlate with owner‐reported aggression towards the family. There were, however, some correlations between three of the subtests and owner-reported stranger aggression: threatening approach (r = 0.33; P = 0.004), tug‐of‐war (r = 0.34; P = 0.006), and friendly greeting (r = 0.33; P = 0.007).  The authors found significant differences in their QAT between dogs who had reportedly bitten or snapped in the past and those who had not, but the sensitivity and specificity of the 3 correlating test collectively (0.76 and 0.73) while reasonably good for a diagnostic test, would yield many false positives given the low incidence of the behaviors of interest (see Patronek & Bradley, 2016 below for an explanation of this relationship)

Kis, Klausz, Persa, Miklosi, & Gacsi (2014) investigated a potential confound in a behavior evaluation used to identify “aggression,” “fear,” and “anxiety” in pet dogs (see below for a discussion on the shelter dogs portion). Three scenarios adopted from Klausz et al.’s (2014) QAT discussed above were used including a friendly approach, a threatening approach, and taking away a bone. The authors hypothesized that results would vary among pet dogs depending on whether or not their owners were present during the evaluation.

Fifty pet dogs were evaluated across the three conditions two separate times, once with and once without their owners present. Whether owners were present for the first or second round was randomized for each subject. Dogs were chained to a tree and spike, with the ability to move forward and backwards. Growling, snarling, snapping with/without attack, and biting were all deemed “aggression” and “anxiety-discomfort” was defined by muzzle licking, scratching, and yawning.

The results indicated that the owner’s presence did affect behavior. For the friendly greeting scenario, dogs showed more “anxiety-discomfort” when owners were absent than present (Z=2.140, P=0.032). “Aggression” increased for eight dogs in the take away bone scenario when owners were present (Z=2.354, P=0.019); four dogs increased from no “aggression” to growling, one increased from no “aggression” to snarling, one increased from no “aggression” to biting or snapping, one increased from no “aggression” to biting, and one increased from growling to biting. All other dogs showed the same level of “aggression” (or lack thereof) with or without owners present; 40 showed no “aggression” on either trial and one bit the artificial hand used to take the bone away on both trials. Finally, for the threatening approach scenario, threatening behaviors increased when the owner was present (Z=2.673, P=0.008). In all, 13 dogs showed an increase in “aggressive” behavior when their owner was present; 12 showed no “aggression” without the owner but growled with the owner and 1 dog growled without the owner, but snarled when the owner was present. The remaining dogs were consistent between conditions; 32 dogs showed no “aggression” in either condition, and 5 growled during both presentations.

The data supported the hypothesis that the presence or absence of a familiar person can significantly impact results, at least in the case of owned dogs. The malleability of the behavior by this simple change indicates a lack of validity for such an assessment and underscores the contextual nature of the expression of threatening and biting behaviors in dogs.

Population: Shelter Dogs


Reliability of evaluations has been studied less often in shelter dogs. The 2015 manuscript by Bennett et al. is one of the only studies that examined test-retest reliability in shelter dogs. Bennett et al. (2015) hypothesized that because stress in shelters changes over time, behavior evaluation (specifically, SAFER) results would also change over time. When analyzing the data, the researchers were particularly interested in cases where scores changed by at least 2 points (on a scale of 1-5) from day 0 to day 3; these differences were used to calculate percent discordance. They felt this was of practical importance because differences of this magnitude could conceivably result in different recommendations and vastly different outcomes for dogs, ranging from adoption availability to euthanasia The authors found that results from the SAFER evaluation varied somewhat, but not significantly from day 0 (intake) to day 3. There was, at best, moderate agreement between days 0 and 3 for most of the subtests, and for the 2 where agreement was good (toy and rawhide removal) the agreement may have been an artifact of extremely low scores recorded for the entire study group. Nearly half of the 49 original sujects did not have data on both days due to lack of availability (e.g., having been returned to owners or having been euthanized for having displayed unhandleable threatening behavior or for simply showing lack interest on the food test) Because of the inconsistencies, Bennett et al. (2015) suggest that shelters avoid testing dogs when they are particularly stressed, but here they employed the untested assumption that the second iteration of the test yielded the more valid result.

The most important finding from this study is that even over as short a time period as 3 days, dogs’ behaviors can change. This fundamental lack of test-retest reliability precludes the possibility of external validity for this behavior assessment.

In 2008, Diesel et al. assessed behavior evaluation inter-rater reliability and intra-rater repeatability among 40 welfare center staff members based on videoed tests of 20 dogs. Overall results were mixed; raters were consistent compared with each other when evaluating the person approaching kennel trials but were inconsistent for general handling and grooming trials. For the kennel approach condition, agreement was moderate between raters for all behavioral responses except “indifferent,” which had poor inter rater reliability. For general handling and grooming trials, there was moderate agreement for “nervous,” “well-behaved,” and “excitable” responses (kappa = 0.46, 0.26, and 0.43, respectively), but poor inter-rater reliability for “fear-aggressive” (kappa = 0.04) and “pushy-aggressive” (kappa = 0.03). Finally, when the subject was introduced to another dog, raters again had moderate agreement for “nervous,” “well-behaved,” and “excitable” responses (kappa = 0.28, 0.24, and 0.37, respectively), but poor reliability for “indifferent,” “fear-aggressive,” and “pushy-aggressive” (kappa = 0.16, 0.17, and 0.11, respectively).

Intra-rater repeatability was measured for 18 participants by having them re-score the same videos they had scored earlier, two months later. For the approach to kennel and general handling and grooming scenarios, reliability was moderate or high for all behaviors. However, when scoring for a second time the same video a of a dog meeting another dog, agreement was poor for “nervous,” “well-behaved,” and “pushy-aggressive” (kappa = 0.19, 0.11, and 0.24, respectively) and moderate for the other responses. In other words, raters not only achieved only moderate agreement with each other on most descriptions of behavior, they quite often disagreed with themselves.

The results underscore the unreliability of behavior evaluations, particularly for behaviors that have severe consequences for dogs. Agreement was lowest for “fear-aggressive” and “pushy aggressive” and “indifferent” characterizations of behavior, any of which can condemn a dog as unadoptable. Moreover, there was poor agreement   between raters on multiple behavioral characteristics, and also poor agreement when a rater looked at the same video he or she had previously scored two months earlier when assessing for “nervous,” “well-behaved,” and “pushy-aggressive” behaviors when meeting another dog.

Kis et al. (2014) investigated a potential confound affecting behavior evaluations previously validated with pet dogs when administered to shelter dogs. Factors that might mediate “aggression,” “fear,” and “anxiety” on behavior evaluations of shelter and pet dogs (see above for a discussion on the pet dogs portion). Three scenarios adopted from Klausz et al.’s (2014) QAT were used including a friendly approach, a threatening approach, and taking away a bone. The authors hypothesized that shelter dogs would be more reactive, and therefore display more “aggressive” behavior if evaluations were conducted after an acclimation period, rather than upon arrival. Twenty-five shelter dogs (17 male) were subjects, and they were tested one or two days after arrival at the shelter, and then again two weeks later. Growling, snarling, snapping with/without attack, and biting were all deemed “aggression.” “Fear-submission” consisted of tail between the legs, dipped head, tensed posture, and laying on its back, and “anxiety-discomfort” was defined by muzzle licking, scratching, and yawning. The only condition in which significant differences between the two tests were found was the “take away bone” subtest. Data showed that more dogs were scored as “aggressive” on the second attempt than the first.

The data partially supported the hypothesis and showed that the timing of a behavior evaluation can correlate with unreliable test-retest results. This finding highlights a major weakness in behavior evaluations as test-retest reliability is necessary for a test to be valid. The results should also be considered with respect to the study by Diesel et al. (2008) that showed the same raters viewing the same video-recorded dogs are unreliable over time for “well-behaved,” “pushy-aggressive,” and “nervous” behavior when the dog was meeting another dog. Thus, it is plausible that the scoring of the behavior, rather than the behavior itself, changed over time. Either way, both studies results highlight the unreliability of such a test.


Finally, the question of whether a behavior evaluation actually tests the behavior of interest, i.e., whether it is valid, has been studied in some specific cases with shelter dogs. For example, Shabelansky, Dowling-Guyer, Quist, D’Arpino, and McCobb (2015) compared shelter dogs’ behaviors towards both a live and a fake dog, to determine whether behavior toward a fake dog tested the dog’s behavior toward actual conspecifics. This subtest was chosen because behavior towards a conspecific is commonly measured in evaluations (e.g., SAFER, Assess-A-Pet). The purpose was to determine whether a fake dog could reliably be used in lieu of a real dog during shelter evaluations. The authors underscore that they were not attempting to predict future behavior, but rather wanted to evaluate consistency between the two scenarios. If the two stimuli yield similar responses this would justify the use of a fake dog instead of a live dog during behavior evaluations, which could potentially save resources and increase safety and/or comfort for shelter dogs and staff.

Forty-five shelter dogs were subjects. A within-subjects design was used such that each dog experienced both the live dog and fake dog trials. Order of presentation was counterbalanced across subjects and sex. Overall agreement between the conditions was high, but the authors acknowledge that this was due to a consistent lack of behavior, rather than a consistent presence. Moreover, behaviors that were consistent due to positive occurrences were friendly behaviors (tail wagging and sniffing, for example), not “aggressive behaviors.” The data showed that with respect to measuring “aggression” on a behavior evaluation, a fake dog is not an adequate substitute for a real dog, as the subjects responded with more “aggression” towards the fake dog than the real dog. If fake dogs are used for behavior evaluations, they are likely to produce false positives. For these reasons, fake dogs should not be used to predict how a dog might behave towards a real dog in the same evaluation, and evaluations that use fake dogs should be discounted as this method is neither reliable nor validated.

Bollen and Horowitz’s (2008) study is an interesting attempt at validating a behavior evaluation. They used a unique approach of testing dogs who were owner-relinquished and comparing owner-reported canine behavior histories with subsequent evaluation results. The authors hoped that in conjunction with behavior histories and demographic data, the evaluation (Assess-A-Pet) would prove useful in predicting aggression and future in-home behaviors. While the approach was novel compared to other validity studies, the study falls short in that adequate controls were missing and the conclusions reach beyond the reported data.

From an experimental design perspective, the study lacked basic controls. The evaluator knew which dogs had been relinquished for “aggression,” and as a group they were indeed more likely fail one or more components of the evaluation. The authors argue that standard scoring sheets were used to prevent bias, but unconscious biases can have profound effects on the data. Given that inter- and even intra-rater reliability are typically poor in these kinds of tests, this lack of blinding of the assessor constitutes a fatal flaw. A researcher might be more tense when working with a dog that was deemed aggressive by a former owner, which could in turn affect the dog’s behavior or the assessor’s perception of it. Furthermore, only one evaluator recorded each dogs’ behavior and that only once; there is no data was collected regard inter-rater or test-retest reliability. Because the evaluator was not blind to the dogs’ histories and because there was no check for rater reliability, the internal validity of the study is severely weakened.

Bollen and Horowitz (2008) conclude that the dogs who were euthanized due to failing the evaluation would have likely been aggressive in the home. However, we cannot say what the euthanized dogs would have or would not have done in the home. That is the crux of the problem not just in this study, but with behavior evaluations in general; for dogs who fail the evaluation, we do not know whether it was a true or false positive because the dog is killed before in-home behavior can ever be assessed.

There have been two studies that examined dogs’ behaviors first while in the shelter, then again in the home after they have been adopted. Both of these studies examined a food aggression subtest. Food aggression tests involve provoking a dog with a fake hand while she is eating. There are variations, but subtests typically involve using the rubber hand to 1) remove a food bowl from the dog while it eats, 2) push the dog’s head away from its food bowl while it eats, 3) stroke the dog while it eats, and 4) place the hand into the bowl while the dog eats. The methodology raises questions of construct validity; do dogs respond the same to a fake hand as they would to their owner at home? Results from Mohan-Gibbons et al., (2012) and Marder et al. (2013) suggest that no, despite failing food guarding subtests, adopted dogs often did not exhibit the same behaviors in home. Moreover, in those cases, where mild food aggression was observed in the home, owners reported that it was easily handled and did not affect their attachment to their dog.

It is worth noting that even when some dogs who failed a behavior evaluation are adopted out and studied (e.g., Mohan-Gibbons et al., 2012), many, many more are ultimately euthanized. Moreover, even in those studies there are exclusion criteria. For example, Mohan-Gibbons et al., (2012) only included dogs that failed the food bowl assessment but passed all of the other subtests. Because the dogs who are euthanized can never be tested in homes, it is impossible to know the true percent of false positives.

Discontinuing Food Guarding Assessments

One study (Mohan-Gibbons et al., 2018) has examined the effects of terminating components of behavior evaluations. Nine shelters participated in a 5-month experimental assessment including two months of baseline, two months of treatment, and a final month of data collection only. During baseline, shelters continued their assessments as normal, whereas during treatment, shelters were asked to cease all formal food guarding assessments. Staff, volunteer, and owner observations were still documented. Dependent variables (shelter intake and outcome statistics; bites and injuries in the shelter; and bites and injuries in the home post adoption) were reported to the experimenters each month.

Findings support sentiments commonly made by behavior evaluation opponents—formal assessments do not generate new information beyond what can be identified by casual observation and owner histories. After discontinuing food guarding assessments, the percentage of dogs exhibiting severe food guarding did not change. When the shelters stopped assessing for food guarding, there was a 3% increase in overall returns, but there was no difference in the rate of returns for food-guarding dogs.

Because assessments did not affect the percentage of dogs identified with food guarding behavior, and because removing the assessments did not increase the number of injuries or bites, the authors recommend discontinuing formal food guarding assessments.

Statistical Evaluation

In 2016, authors Gary Patronek and Janis Bradley (both of whom are affiliated with National Canine Research Council) examined canine behavior evaluations from a statistical and theoretical point of view. The authors used existing data on dog bites and owner surrenders attributed to problematic aggression to determine the prevalence of such behavior among dogs living in shelters. They estimated optimistic sensitivity and specificity of evaluations based on the best results achieved in analogous human diagnostic testing in the context of these prevalence estimates. This demonstrated that even if tests that yielded rates of specificity and sensitivity better than those that have been documented in the few validity studies of these tests could be developed, statistically, behavior evaluations would still be “no better than flipping a coin” for determining whether a dog will exhibit threat or biting behavior problematic to the owner after adoption.

This converging operations approach – with researchers of varying backgrounds and expertise examining behavior evaluations from different viewpoints – is quickly painting a clear picture; considering their severe consequences, behavior evaluations are too often unreliable and invalid, and even if they weren’t, the low prevalence of the behaviors of interest mean inevitably impractical results. Patronek and Bradley (2016) recommend that in lieu of evaluations, shelters redirect those limited resources to activities such as walks, baths, play, and training which will familiarize staff with individual dogs while improving the animals’ welfare and potentially their adoptability.

The following papers are referenced in the above literature review; and each has a link to a National Canine Research Council Summary & Analysis, in addition to their descriptions above:

Patronek, G. J., Bradley, J., & Arps, E. (2019). What is the evidence for reliability and validity of behavior evaluations for shelter dogs? A prequel to “No better than flipping a coin”. Journal of Veterinary Behavior, (31) 43-58. doi:

Marder, A. R., Shabelansky, A., Patronek, G. J., Dowling-Guyer, S., & D’Arpino, S. S. (2013). Food-related aggression in shelter dogs: a comparison of behavior identified by a behavior evaluation in the shelter and owner reports after adoption. Applied Animal Behaviour Science, 148(1-2), 150–156. doi:

Barnard, S., Siracusa, C., Reisner, I., Valsecchi, P., & Serpell, J. A. (2012). Validity of model devices used to assess canine temperament in behavioral tests. Applied Animal Behaviour Science, 138, 79-87. doi:

Bennett, S. L., Litster, A., Weng, H., Walker, S. L., & Luescher, A. U. (2012). Investigating behavior assessment instruments to predict aggression in dogs. Applied Animal Behaviour Science, 141(3-4), 139-148. doi:

Bennett, S. L., Weng, H., Walker, S. L., Placer, M., & Litster, A. (2015). Comparison of SAFER behavior assessment results in shelter dogs at intake and after a 3-day acclimation period. Journal of Applied Animal Welfare Science, 18(2), 153-168. doi: 10.1080/10888705.2014.999916

Bram, M., Doherr, M. G., Lehmann, D., Mills, D., & Steiger, A. (2008). Evaluating aggressive behavior in dogs: a comparison of 3 tests. Journal of Veterinary Behavior, 3(4), 152-160. doi:

Shabelansky, A., Dowling-Guyer, S., Quist, H., D’Arpino, S. S., & McCobb, E. (2015). Consistency of shelter dogs’ behavior toward a fake versus real stimulus dog during a behavior evaluation. Applied Animal Behaviour Science, 163, 158-166. doi:

Svartberg, K., Tapper, I., Temrin, H., Radesater, T., & Thorman, S. (2005). Consistency of personality traits in dogs. Animal Behaviour, 69, 283-291. doi:

Mohan-Gibbons, H., Weiss, E., & Slater, M. (2012). Preliminary investigation of food guarding behavior in shelter dogs in the United States. Animals, 2, 331-346. doi:

Diesel, G., Brodbelt, D., & Pfeiffer, D. U. (2008). Reliability of assessment of dogs’ behavioural responses by staff working at a welfare charity in the UK. Applied Animal Behaviour Science, 115, 171-181. doi:

Bollen, K. S. & Horowitz, J. (2008). Behavioral evaluation and demographic information in the assessment of aggressiveness in shelter dogs. Applied Animal Behaviour Science, 112, 120-135. doi:

Kis, A., Klausz, B., Persa, E., Miklosi, A., & Gacsi, M. (2014). Timing and presence of an attachment person affect sensitivity of aggression tests in shelter dogs. Veterinary Record, 174(8). doi:

Klausz, B., Kis, A., Persa, E., Miklosi, A., & Gacsi, M. (2014). A quick assessment tool for human-directed aggression in pet dogs. Aggressive Behavior, 40(2), 178-188. doi:

Mohan-Gibbons, H., Dolan, E. D., Reid, P., Slater, M. R., Mulligan, H., & Weiss, E. (2018). The Impact of Excluding Food Guarding from a Standardized Behavioral Canine Assessment in Animal Shelters. Animals8(2), 27. doi:

Patronek, G. J. & Bradley, J. (2016). No better than flipping a coin: Reconsidering canine behavior evaluations in animal shelters. Journal of Veterinary Behavior, 15, 66-77. doi:

Additional References:

De Meester, R. H., De Bacquer, D., Peremans, K., Vermeire, S., Planta, D. J., Coopman, F., & Audenaert, K. (2008). A preliminary study on the use of the Socially Acceptable Behavior test as a test for shyness/confidence in the temperament of dogs. Journal of Veterinary Behavior, 3(4), 161-170. doi:

Fuchs, T., Gaillard, C., Gebhardt-Henrich, S., Ruefenacht, S., & Steiger, A. (2005). External factors and reproducibility of the behaviour test in German shepherd dogs in Switzerland. Applied Animal Behaviour Science, 94(3-4), 287-301. doi:

Haverbeke, A., Pluijmakers, J., Diederich, C. (2015). Behavioral evaluations of shelter dogs: Literature review, perspectives, and follow-up within the European member states’s legislation with emphasis on the Belgian situation. Journal of Veterinary Behavior, 10(1), 5-11. doi:

Netto, W. J. & Planta, D. J. U. (1997). Behavioural testing for aggression in the domestic dog. Applied Animal Behaviour Science, 52(3-4), 243-263. doi:

Christensen, E., Scarlett, J., Campagna, M., & Houpt, K. A. (2007). Aggressive behavior in adopted dogs that passed a temperament test. Applied Animal Behaviour Science, 106(1-3), 85-95. doi:

Rayment, D. J., De Groef, B., Peters, R. A., & Marston, L. C. (2015). Applied personality assessment in domestic dogs: Limitations and caveats. Applied Animal Behaviour Science, 163, 1-18. doi:

Van der Borg, J. A. M., Netto, W. J., & Planta, D. J. U. (1991). Behavioural testing of dogs in animal shelters to predict problem behavior. Applied Animal Behaviour Science, 32(2-3), 237-251. doi:

Page last updated September 23, 2019