Assessing Reliability of Expert Ratings Among Judges Responding to a Survey Instrument Developed to Study the Long Term Efficacy of the ABET Engineering Criteria, EC2000
Date of Award
Thesis - Open Access
Master of Science in Human Factors & Systems
Human Factors and Systems
Shawn Doherty, Ph.D.
Elizabeth L.Blickensderfer, Ph.D.
Rosemarie Reynolds, Ph.D.
In today’s assessment processes, especially those evaluations that rely on humans to make subjective judgements, it is necessary to analyze the quality of their ratings. The psychometric issues associated with assessment provide the lens through which researchers interpret results and important decisions are made. Therefore, inter-rater agreement (IRA) and inter-rater reliability (IRR) are pre-requisites for rater-dependent data analysis. A survey instrument cannot provide “good” information if it is not reliable; in other words, reliability is central to the validation of an instrument. When judges cannot be shown to reliably rate a performance, item, or target, the question becomes why the judges’ responses are different from one another. If the judges’ ratings covary unreliably because the construct is poorly defined or the rating framework is defective, then the resultant scores will have questionable meaning. On the other hand, if the judges’ ratings differ because they have a true difference in opinion, this is of importance to the researcher and may not necessarily diminish the validity of the scores. The intraclass correlation coefficient (ICC) is the most efficient method to assess these rater differences and identify the specific sources of inconsistency in measurement. This study examined how ICCs can be used to inform researchers of the extent in which legitimate differences of opinion may appear as a lack of reliability and/or agreement, demonstrating the need for analyzing survey data beyond standard descriptive statistics. Overall, both the IRA and IRR correlations, as calculated by ICC, ranged from .79 to .91 indicating high levels of agreement and consistency in the scoring among the judges' ratings. When group membership was accounted for the IRA values increased suggesting the common judges agreed more than those judges who varied in their perspectives.
Scholarly Commons Citation
Litzinger, Tracy L., "Assessing Reliability of Expert Ratings Among Judges Responding to a Survey Instrument Developed to Study the Long Term Efficacy of the ABET Engineering Criteria, EC2000" (2006). Master's Theses - Daytona Beach. 123.