Evaluating aggressive behavior in dogs: a comparison of 3 tests

To best understand this article in the context of the behavior evaluation literature, please see National Canine Research Council’s complete analysis here.

Article Citation:

Bram, M., Doherr, M. G., Lehmann, D., Mills, D., & Steiger, A. (2008). Evaluating aggressive behavior in dogs: a comparison of 3 tests. Journal of Veterinary Behavior, 3(4), 152-160. doi:

National Canine Research Council Summary and Analysis:

Similar to the Bennett et al. (2012) study which compared two commonly used behavior evaluations in the United States, this 2008 study by Bram, Doherr, Lehmann, Mills, and Steiger compared three behavior tests that are common in Switzerland. However, this was not an attempt at external validation of the tests, but rather an examination of inter-test reliability, i.e., did 3 tests intended to identify the same behaviors agree with each other. The tests were all designed to assess “aggressive behavior.”

The three instruments were A) Test of the American Staffordshire Terrier Club (ASTC) of Switzerland, B) The “Halterprufung,” Switzerland, and C) The Test of the Canton of Basel-Stadt, Switzerland; It should be noted that Test A and C have legal force in the country for certifying owned dogs as either safe or potentially dangerous. Their validity has not been demonstrated either retrospectively or predictively. These specific tests were chosen because of several important similarities; they are all currently (at time of study, 2007) and commonly used in Switzerland, they are each intended to identify “potentially dangerous dogs,” all follow the assumption that “aggressive behavior” is undesirable, and they were all developed by individuals experienced in canine behavior. Despite their common goals, the tests vary widely in structure and duration.

Sixty dogs were evaluated on interspecific and intraspecific behavior using the three behavior tests. Because variation in scoring would not allow for easy comparisons (e.g., numerical scores and pass/fail), and because they were solely interested in comparing the tests to each other to assess inter test reliability, the involved experts developed a uniform method for recording data. After each test, the dogs were scored on a categorical scale (1- open, friendly, neutral behavior; 2- dominant behavior or mistrustful, fearful behavior without aggression; 3- threatening, warning; 4- overt aggression with warning; or 5- overt aggression without warning) for both intra- and interspecies behaviors. There were two experts for each of the three tests (6 experts total), but only one assessor was present for a given test. That is, for Test A, either assessor 1 or assessor 2 was present, for Test B, either assessor 3 or 4 was present, and for Test C, either assessor 5 or 6 was present.

Interestingly, Test A, the traditional ASTC test examines the dog-owner duo and records “undesirable behavior” for both members, including “aggression” and disobedience for dogs, and uncontrolled or insecure behavior towards dogs and violating animal protection laws for owners. Because of the nature of this study, owner behavior was not pertinent and thus not included. The test includes nine subtests, some taking place off-leash in a closed area and the others on-leash on a quiet road. Subtests consisted of typical dog-owner experiences including commands to come, sit or lay down, and stay, interactions with individuals and groups, interacting with a “stimulus dog,” and playing. The owner was allowed to interact with their dog in their usual way, using treats and praise.

Test B, the “Halterprufung,” Switzerland, also evaluates the owner-dog dyad. The dogs’ behavior is assessed in normal and novel/stressful situations including being called out of a game, walking on- and off-leash, responding to commands with distraction, introductions to new animals (chickens and goats), and chaotic interactions with groups of people and dogs.

Test C, The Test of the Canton of Basel-Stadt, Switzerland is shorter than the others, but has similar conditions. Behavior is observed while the owner meets the assessor, walks the dog past kennels, unleashes the dog, introduces the dog to a stimulus dog, and then re-leashes and walks away. The owner’s behavior is observed and recorded, in addition to that of the dog.

The important measure for this study was the level of agreement between tests (interest reliability), rather than any of the tests on their own. k values were used to determine the level of agreement between each test, beyond what is expected by chance. k can range from 0-1 with ordinal levels of agreement (0-0.2 is slight agreement, 0.21-0.4 is fair agreement, 0.41-0.6 is moderate agreement, 0.61-0.8 is substantial (high) agreement, and 0.81-1 is very high agreement.)

There was high agreement across the three tests for “open, friendly, neutral behavior.” Most dogs received this score on all three tests. Overall, there was only slight agreement between the three tests for intra-specific behavior (k =0.133, P = 0.014) and inter-specific behavior (k = 0.135, P = 0.014). Because so few dogs in the sample showed any aggression, the dogs that scored a 2 or higher on any test were grouped together under the label “potentially aggressive” and were further analyzed with pairwise comparisons. It is important to note that dogs who showed behaviors interpreted as fear or mistrust but not behaviors attributed by the assessors to aggression were included in this grouping. Inter- and intra-species behavior analyses remained separate. Despite small sample sizes (n = 23 and 29 for inter-specific and intra-specific behavior, respectively) the data showed significant differences in the results between tests A and B (P = 0.035) and between tests B and C (P < .001), and the results were inconclusive between tests A and C; the three commonly used tests were not reliable in predicting inter-specific aggression (aggression towards humans).

At first glance the simplified scoring system might appear to be recording opinion based on observation rather than objective conclusions from recorded behaviors. However, the purpose was to determine if the three tests (which have the same aim of identifying aggressive dogs) could reliably produce behavior that would be labeled similarly by different observers. Adequate controls were in place with no observer present during others’ tests, and they were not aware of other observer’s scores. Furthermore, a Latin square design was used to control for order effects and owners were not given feedback until they had completed all three tests. The assessors were experts with respect to their tests and would be the individuals typically evaluating dogs outside of the experiment. Finally, the adopted scoring system allows for direct comparisons between tests, which would not otherwise be possible. Thus, their goal of assessing the three tests for their ability to detect potential aggression was accomplished.

The lack of intertest reliability revealed by this study strongly supports the author’s conclusion that such tests lack validation and calls into question their use in making decisions about how individual dogs may be kept.

Abstract and Link to Purchase Full Text of the Original Article: