Content Validity & Reliability Procedure

CAEP Criteria for Evaluation of EPP-Created Assessments and Surveys
Before you proceed with the content validity and reliability procedures, please view the CAEP Criteria for Evaluation of EPP-Created Assessments and Surveys.
CAEP Criteria for Evaluation of EPP-Created Surveys
Before you proceed with the content validity and reliability procedures, please view the CAEP Criteria for Evaluation of EPP-Created Surveys.


The College of Education and Professional Development (COEPD) at Marshall University has established a content validity procedure for all Education Preparation Provider (EPP) created assessments and surveys, including key assessments, performance tasks, clinical evaluations, and national board-certified exams.  The EPP adopted the procedure to evaluate its assessments in Spring 2022. The content validity and reliability procedures are used by both initial- and advanced-level programs.  Procedures follow the guidelines outlined in the CAEP Evaluation Framework document for EPP-Created Assessments to design, pilot, and judge the adequacy of the assessments created by the EPP.

The purpose of the content validity procedure is to provide guidance for collecting evidence and to document the adequate technical quality of assessment instruments and rubrics used to evaluate candidates in the COEPD.

CAEP Defined Assessments

CAEP uses the term “assessments” to cover content tests, observations, projects or assignments, and surveys – all of which are used with candidates.  Surveys are often used to gather evidence on candidate preparation and candidate perceptions about their readiness to teach.  Surveys are also helpful to measure the satisfaction of graduates or employers with preparation and the perceptions of clinical faculty about the preparedness of EPP completers.

Assessments and rubrics are used by faculty to evaluate candidates and provide them with feedback on their performance.  Assessments and rubrics should address relevant and meaningful candidate knowledge, performance, and dispositions, aligned with CAEP standards.  An EPP will use assessments that comprise evidence offered in accreditation self-study reports to examine candidates at various points from admission through completion consistently.  These are assessments that all candidates are expected to complete as they pass from one stage of preparation to the next or that are used to monitor the progress of candidates’ developing proficiencies during one or more stages of preparation.

EPP-Defined Assessment

The definition of assessment adopted by the EPP includes three significant processes: data collection from a comprehensive and integrated set of assessments, analysis of data for forming judgments, and use of analysis in making decisions.  Based on these three processes, assessment is operationally defined as a process in which data/information is collected, summarized, and analyzed as a basis for forming judgments.  Judgments then form the basis for making decisions regarding continuous improvement in our programs.

EPP Five Year Review Cycle

The EPP established a consistent process to review all EPP-created assessments/rubrics on a five-year cycle when possible.

Content Validity Procedure

The COEPD will assess the validity of an assessment in two ways: First, the EPP will calculate Lawshe’s Content Validity Ratio (Step 2: CVR) for each item of an assessment.  Secondly, the EPP will calculate a Construct Validity Ratio (Step 3: Construct Validity) by soliciting external feedback from content experts regarding their opinion about assessment constructs and their related assessment items.

The EPP faculty will use the following procedure to ensure the validity of EPP-Created Assessments:

Whether you’re creating a new assessment or reviewing a current assessment, you should begin by convening a small working group to create/review the assessment.  Ideally, there should be a group of 5-7 individuals consisting of faculty and at least one external content expert.  The working group will do the following:

  1. Identify Performance Domains: Performance Domains (domain) can often be thought of as your standard (i.e. Commitment to Students).  It’s okay to use your standard as your domain.
  2. Provide an operational definition for your domain (i.e. Commitment to Students: The Creation of a Learning Environment and Community to Promote Successful Teaching and Learning).
  3. Compile Initial Rubric Items:

Q-Methodology is a card-sort technique designed to study subjectivity (views, opinions, beliefs, values, etc.) and used to identify essential components of an assessment. For most of our purposes, these essential components are thought of as items as in the picture below.  Qualtrics is a valuable tool to conduct a Q-sort using the Pick, Group, and Rank question type if the Q-sort is desired to be done electronically.

  1. Identify a Q-Sort Group that includes a mixture of COEPD faculty members, students, and external experts (classroom teachers, supervisors, etc.) to identify overarching constructs, the constructs operational definitions, and assessment items.
  2. Identify a panel of experts to distribute the overarching constructs and items to once they have been identified.
  3. The panel of experts should include content experts outside the college, with no less than 15 people per panel. Minimal credentials for each expert should be established by consensus from program faculty, and their credentials should bear up to reasonable external scrutiny.
  4. Use Qualtrics to distribute a short survey to the panel of experts that includes your overarching constructs, the overarching construct’s operational definitions, and the assessment items.  Ask your panel to drag each assessment into one of three categories:
    1. Essential: The assessment item is essential to the overarching construct.
    2. Useful but Not Essential: The assessment item is useful, but not essential
    3. Not Necessary: The assessment item is not necessary for the overarching construct.

Virtual Q-Sort Screenshot

Once you receive feedback from the expert panel, use Lawshe’s Content Validity Ratio (CVR) to calculate a proportional level of agreement.  Lawshe’s CVR will be calculated using the formula (ne – N/2)/(N/2) for each assessment item, whereas ne represents the number of individuals who rated the item as essential and N represents the total number of Q-Sort members.

    1. Use the CVR Chart to identify the number of panelists and the CVR critical value.
    2. Compare the ratio obtained with the CVR Chart above to determine the minimum CVR value required for an item to be valued based on the number of panelists.

Example of Using Excel to Find CVR

Once Lawshe’s CVR has been calculated, you should now only have the assessment items that meet the CVR as part of your assessment.  You can also now create your assessment rubric.  Using the same expert panel as before, create an Assessment Packet for Panel of Experts. The packet should include:

    1. A letter explaining the purpose of the study, the reason the expert was selected, a description of the measure and its scoring, and an explanation of the response form.
    2. A copy of the assessment instructions distributed to candidates.
    3. A copy of the rubric distributed to candidates and used to evaluate the assessment.
    4. The response form aligned with the assessment/rubric for the panel members to rate each item.

The Response form can be found in the COEPD Resources Team > Assessment > Content Validity.

The Response Form for each EPP-Created Assessment should be completed by panel members in Procedure #3, asking them to rate items that appear on the rubric. Program faculty should work collaboratively to develop the response form required for each rubric used to evaluate candidate performance officially.

    1. The overarching construct that the item purports to measure should be identified and operationally defined for each item.
    2. The item should be written as it appears on the assessment.
    3. Experts should rate the item’s level of representativeness in measuring the aligned overarching construct on a scale of 1-4, with 4 being the most representative. Space should be provided for experts to comment on the item or suggest revisions.
    4. Experts should rate the importance of the item in measuring the aligned overarching construct, on a scale of 1-4, with 4 being the most essential. Space should be provided for experts to comment on the item or suggest revisions.
    5. Experts should rate the item’s level of clarity on a scale of 1-4, with 4 being the most clear and 1 being not clear. Space should be provided for experts to comment on the item or suggest revisions.

Content Validity Index (CVI) will be completed to ensure items are considered acceptable by the panel of experts. A CVI of .80 or higher is deemed to be acceptable.

CVI = Number of experts who rated the item as 3 or 4
Number of total experts

A facilitator prepares materials for scorers to begin calibrating the assessment rubric. Materials include the assessment instructions, the grading rubric, and a student artifact.

Using the rubric, scorers read the assignment instructions, view the student artifact, and score the artifact using the assessment rubric. Scorers should note words and phrases in the performance descriptors that best describe the quality of work.

One at a time, scorers share scores for each rubric category while a recorder completes a group score sheet. Do not provide an explanation for the score at this point. Once all scores are shared and recorded by the facilitator, the scorers will discuss differences in the scores, where the differences occurred, and why scorers may have evaluated the artifact differently.  Scorers justify their evaluation by pointing to specific language in the rubric and evidence in the student artifact. Discuss each piece of student work and resolve issues that may be present because of rubric language, or the evidence provided in the student artifact. Scorer consensus should be reached.

During the calibration process, the facilitator will be using a group score sheet to check for inter-rater reliability (IRR).  IRR is the extent to which two or more raters agree and addresses the issue of consistency of the implementation of a rating system.