Reliability of assessment of dogs’ behavioural responses by staff working at a welfare charity in the UK

To best understand this article in the context of the behavior evaluation literature, please see National Canine Research Council’s complete analysis here.

Article Citation:

Diesel, G., Brodbelt, D., & Pfeiffer, D. U. (2008). Reliability of assessment of dogs’ behavioural responses by staff working at a welfare charity in the UK. Applied Animal Behaviour Science, 115, 171-181. doi:10.1016/j.applanim.2008.05.005

National Canine Research Council Summary and Analysis:

This 2008 study by Diesel, Brodbelt, and Pfeiffer assessed inter-rater reliability and intra-rater repeatability among 40 staff members at the largest canine welfare charity in the UK (Dogs Trust) with respect to canine behavior evaluations. Test validity was not addressed in the current study.

The authors designed a single protocol and assessment for use in the study, noting that assessments and scoring cards are inconsistent across the many rehoming centers within the organization. Twenty dogs from one center were evaluated across three situations: being approached in their kennel, being groomed and handled, and encountering another dog. Six possible ratings were used in each context including “fear-aggressive,” “nervous,” “indifferent,” “well-behaved,” “excitable,” and “pushy-aggressive.” The trials were videotaped and sent to the 16 remaining Dogs Trust centers.

Not every dog was evaluated across all situations; 19 dogs were assessed for the kennel approach, 13 were assessed for grooming and handling, and 6 were assessed for meeting a conspecific. Similarly, not all raters were able to evaluate every dog.

Inter-rater reliability: comparing scores between observers

Entire sample

For the kennel approach condition, all behavioral characteristics had moderate agreement between raters except “indifferent,” which had poor agreement (kappa = 0.04). For general handling and grooming, there was moderate agreement for “nervous,” “well-behaved,” and “excitable” responses (kappa = 0.46, 0.26, and 0.43, respectively), but poor inter-rater reliability for “fear-aggressive” (kappa = 0.04) and “pushy-aggressive” (kappa = 0.03). Finally, when the subject was introduced to another dog, raters again had moderate agreement for “nervous,” “well behaved,” and “excitable” responses (kappa = 0.28, 0.24, and 0.37, respectively), but poor reliability for “indifferent,” “fear-aggressive,” and “pushy-aggressive” (kappa = 0.16, 0.17, and 0.11, respectively).

Trained sample

The researchers further examined a sub-sample of raters which included only those with formal behavior training or at least 8 years of field experience. Inter-rater reliability increased for all behavior measures except “well-behaved” during the general handling and grooming condition (remained the same) and “nervous” when meeting another dog (a slight decrease in agreement, kappa = 0.25).

Intra-rater reproducibility: comparing scores for same observers over time

Intra-rater repeatability was measured for 18 participants by having them re-score the same videos two months later. In general, intra-rater reliability was maintained over the two months. For the approach to kennel and general handling and grooming scenarios, reliability was moderate or high for all behaviors. For meeting another dog, agreement was poor for “nervous,” “well-behaved,” and “pushy-aggressive” (kappa = 0.19, 0.11, and 0.24, respectively) and moderate for the other responses.


The results underscore the unreliability of behavior evaluations, particularly for behaviors that have severe consequences for dogs. Agreement was lowest for “fear aggressive” and “pushy aggressive” assessments when being handled or groomed, for example, both of which can condemn a dog as unadoptable. Moreover, there was poor agreement between raters on multiple behavioral characteristics, and also poor agreement within the same raters over time when assessing “nervous,” “well-behaved,” and “pushy-aggressive” behaviors when meeting another dog. This is particularly problematic because they were viewing not only the same dogs, but the exact same behaviors as the sequences were videotaped.

Abstract and Link to Purchase Full Text of the Original Article: