Evaluation Framework for using GenAI to Create Formative Assessment

Jun 27, 2025

—

Note: This article was originally written for ETEC 565 class

The evolution of AI models has enabled faster development of high-quality Open Educational Resources (OER). Traditionally, creating OER has been time-consuming due to the effort required to address licensing issues, as well as the labor and expertise needed for content creation. With the advancement of AI tools, Open Education scholars such as Wiley (2023) suggest that these technologies can significantly reduce both the time and cost involved in developing OER. Wiley (2023) also provides several examples of using ChatGPT to quickly generate drafts for formative assessments.

At the Teaching and Learning department where I work, we use a tool called H5P to create ungraded formative assessments embedded in open courses or open textbooks. H5P is a tool that enables the creation of interactive HTML5-based learning objects, such as multiple-choice questions, interactive videos, and paragraph-sorting tasks. (H5P, n.d.) Koedinger et al. (2016) found that students who complete multiple interactive practice activities tend to achieve higher learning outcomes.

Therefore I have developed a framework to help assess the quality of AI generated assessment below:

Framework

Accuracy / Relevance:

Is the content of the formative assessment accurate?
Is it relevant to and appropriate for the target audience?
Is the content creative and engaging, or does it feel generic?

Pedagogy:

Does the assessment follow best practices in learning design?(e.g., avoids trick questions, includes fair but challenging distractors)
Does it fulfill the intended learning objectives or goals?
Can students access the assessment without additional cost?

Ethics:

Are there any biases in the content (e.g., gender, ethnicity)?
Does it reinforce harmful stereotypes?
Does it violate principles of Indigenous Data Sovereignty?
Does the creation process require student data or confidential information?
Does the output infringe on intellectual property rights?

Sustainability / Practicality:

How many prompts or iterations were needed to generate usable content?
Could similar quality be achieved with lower environmental impact or simpler tools?
Is it possible to complete the work without using AI?

I took the concept of accountability into consideration when developing the above framework. AI systems often operate as “black boxes,” obscuring the processes that produce their outputs and complicating efforts to assign responsibility (Coleman, 2021). The framework above is designed to help educators use AI tools in a more responsible and transparent way. In order to design the above frameworks, I have also referred to University of Waterloo’s resources on Designing Multiple-Choice Questions for Pedagogy, and UBC’s Guideline on Use of GenAI and Teaching and Learning for Ethics component.

Testing the Framework

To test this framework, I used ChatGPT-4o to draft multiple-choice quiz questions that I could create in H5P and embed in an open course I am developing. The quiz serves as an ungraded formative assessment. The topic is Creative Commons licenses, which are commonly used in Open Educational Resources (OER). To begin, I gave the following prompt to ChatGPT-4o:

“Can you create a multiple-choice quiz on Creative Commons licenses? This is a formative, ungraded assessment. To make the questions more engaging, please include a short case study within the question.”

Below is the output that it got generated by ChatGPT-4o:

” Question 1: Choosing the Right License for Remixing

Case Study:
Liam creates an educational video about climate change using publicly available animations and images. He wants to allow others to remix his video for non-commercial purposes, as long as they credit him and share their versions under the same terms.

Which Creative Commons license should Liam choose?

A. CC BY
B. CC BY-NC
C. CC BY-NC-SA
D. CC BY-NC-ND

✅ Correct Answer: C. CC BY-NC-SA
💡 Explanation: Liam wants to allow remixing (so ND is not suitable), restrict commercial use (NC), and require share-alike (SA).

Question 2: Using a Work with Restrictions

Case Study:
Sofia found a blog post licensed under CC BY-ND and wants to translate it into another language to share with her students.

Can Sofia legally do this under the current license?

A. Yes, as long as she credits the author.
B. Yes, if she shares the translation under the same license.
C. No, because translations are considered derivative works.
D. No, unless she makes it available for free.

✅ Correct Answer: C. No, because translations are considered derivative works.
💡 Explanation: “ND” (No Derivatives) prohibits modifications, including translations.

Question 3: Combining Open Resources

Case Study:
Jin is creating a course using two CC-licensed videos:

Video A: CC BY-SA
Video B: CC BY-NC

He wants to combine them into a single edited video.

Can he do this while complying with both licenses?

A. Yes, both licenses are open, so it’s allowed.
B. No, because SA and NC are incompatible in this case.
C. Yes, but only if he removes the NC requirement.
D. Yes, as long as he credits both sources.

✅ Correct Answer: B. No, because SA and NC are incompatible in this case.
💡 Explanation: ShareAlike (SA) requires derivatives to carry the same license, but CC BY-SA and CC BY-NC can’t be legally combined due to conflicting terms.

Question 4: Attribution Requirements

Case Study:
Maya uses a Creative Commons image in her presentation and adds a simple line at the bottom:
Image from Wikimedia Commons

Is this attribution sufficient under a standard CC BY license?

A. Yes, as long as the platform is mentioned.
B. No, she must include the creator’s name, license type, and a link.
C. Yes, because it’s educational use.
D. No, unless she also edits the image.

✅ Correct Answer: B. No, she must include the creator’s name, license type, and a link.
💡 Explanation: Proper attribution under CC requires: Title, Author, Source, and License (TASL).

Question 5: Commercial Use Clarity

Case Study:
An entrepreneur wants to use a CC BY-NC photo in a free brochure promoting their paid online course.

Does this count as commercial use under CC definitions?

A. No, because the brochure is free.
B. Yes, because it indirectly promotes a paid service.
C. No, since the license allows educational use.
D. Yes, but only if they don’t make a profit.

✅ Correct Answer: B. Yes, because it indirectly promotes a paid service.
💡 Explanation: Commercial use includes indirect profit-making, such as marketing a product.

Would you like this exported into an H5P-compatible format or embedded into a Google Form?

“ (OpenAI, 2025)

Analysis Based on the Framework

Analyzing the above output:

Accuracy

In terms of accuracy, the content of the formative assessment is accurate. Although I did not specify the target audience in my prompt, I anticipated the formative assessment to be targeted towards students in higher education who are new to the concept of Creative Commons License. The characters who are represented in questions represent the target audience, which are mainly students and instructors.

Although I didn’t include it in the original prompt, the output also provided an explanation for why each answer is correct. The explanations are generally well written. However, if I were to suggest an improvement, I would expand the feedback for question 5 to clarify that “non-commercial” refers to the purpose of use, not the type of user or organization.

Pedagogy

In terms of pedagogy, the quality of the questions is excellent.The goal of this formative assessment is to help students understand the basic concepts of Creative Commons. The quiz overall covers foundational concepts of Creative Commons such as the different types of Creative Commons licenses and best practices for attributing and combining open resources.

I also value the inclusion of questions on non-commercial use and remixing, since these are nuanced topics that often confuse learners.

Bias

In terms of Bias, with the name that appears on each of the questions, they are ethnically diverse. They included a mix of English and non-English names. I also do not see harmful stereotypes portrayed nor violate indigenous sovereignty rights nor intellectual property rights. In addition, I do not see private information generated in this output.

Sustainability

In terms of sustainability, I only used one prompt to generate the quiz. According to Aljbour et al. (2024)’s estimate on energy use, that would be 2.9 Wh per prompt.

I could probably have crafted similar multiple-choice questions myself, but looking at the output, it would have taken me more time to draft them. As a result, I would likely have used more energy if I had created everything from scratch.

Additional Analysis

Reflecting on the framework, the assessment generated by ChatGPT-4o is surprisingly high quality. I suspect the two main reasons for this are personalization and the availability of relevant material. I noticed that, even though I did not specify it in the prompt, the characters portrayed in the multiple-choice questions were primarily students and faculty. Additionally, ChatGPT-4o offered to output the quiz in H5P format without being prompted. Therefore, I suspected that ChatGPT-4o might have referred to my past interactions, similar to what I had written in a previous blog post. To confirm this, I asked whether it had accessed any past conversations or used memory. The response I received is shown below:

“Contextual signals from your prompt
Phrases like “formative,” “case study,” and “Creative Commons” strongly suggest open education or higher ed contexts.

General trends in open pedagogy

H5P is one of the most widely adopted open tools for creating ungraded interactive content, especially in:
Pressbooks (often used in OER)
Moodle / LMS integrations
Canadian and European universities

So I don’t assume everyone uses it — but in absence of further detail, I pick a tool that:

Is free and open-source
Supports ungraded quizzes
Is popular among educators who care about Creative Commons

That’s why I suggested H5P, but as an optional export, not a core assumption.

“(OpenAI, 2025)

I was extremely surprised by how well ChatGPT-4o was able to detect patterns and understand context based on the subtle keywords I included in the prompt. It reminded me of an article I read recently. The article discussed how prompt engineering, which was a popular job in 2023, has become largely obsolete due to improvements in large language models (LLMs). These models now recognize context more effectively and rely less on carefully crafted prompts .(Bousquette, 2025).

In addition to ChatGPT-4o’s pattern recognition abilities, another factor that influenced the quality of the output is the availability of source data. Most of the educational resources from Creative Commons License are publicly accessible, and have CC-BY license on it, which allows reuse of the resources. The result could have been very different if the topic that I have chosen to create a formative assessment is hidden behind the paywall or niche topic.

References

Aljbour, J., Wilson, T., & Patel, P. (2024). Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. EPRI White Paper No. 3002028905.https://www.epri.com/research/products/000000003002028905

Anthropic (Director). (2025, June 12). Lesson 4: A closer look at Delegation | AI Fluency: Framework & Foundations Course [Video recording]. https://www.youtube.com/watch?v=EljzyfdYkrc

Bousquette, I. (2025, April 25). The Hottest AI Job of 2023 Is Already Obsolete. Wall Street Journal. https://www.wsj.com/articles/the-hottest-ai-job-of-2023-is-already-obsolete-1961b054

H5P. (n.d.). Getting started. Retrieved June 21, 2025, from https://h5p.org/getting-started

Koedinger, K. R., McLaughlin, E. A., Jia, J. Z., & Bier, N. L. (2016). Is the doer effect a causal relationship? How can we tell and why it’s important. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, 388–397. https://doi.org/10.1145/2883851.2883957

OpenAI. (2025). ChatGPT (GPT-4o version) [Large language model]. https://chat.openai.com/chat

Sun, Z., Du, X., Song, F., Wang, S., Ni, M., Li, L., & Lo, D. (2024). Don’t Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems. ACM Trans. Softw. Eng. Methodol., 34(1), 21:1-21:22. https://doi.org/10.1145/3688831

University of British Columbia. (n.d.). Guidelines for all uses of GenAI in Teaching & Learning. Generative AI. Retrieved June 21, 2025, from https://genai.ubc.ca/guidance/teaching-learning-guidelines/guidelines-for-all-uses-of-genai-in-teaching-learning/

University of Waterloo. (n.d.). Designing Multiple-Choice Questions | Centre for Teaching Excellence | University of Waterloo. Retrieved June 21, 2025, from https://uwaterloo.ca/centre-for-teaching-excellence/catalogs/tip-sheets/designing-multiple-choice-questions

Wiley, D. (2023, January 23). AI, instructional design, and OER. OpenContent. https://opencontent.org/blog/archives/7129

Winterson, J. (2022). 12 Bytes: How We Got Here. Where We Might Go Next. Grove Press.

Disclaimer

Although all ideas are my own and sources are properly cited, this post has been reviewed using ChatGPT-4o to improve grammar, sentence structure, and citation formatting. In addition, I used ChatGPT-4o to refine the framework I created. I have double-checked the revised paragraph to ensure accuracy in both content and formatting. An example of my interaction with ChatGPT-4o for this article can be accessed here.

This work by Rie Namba is licensed under CC BY 4.0

DreamsAnatomy

Evaluation Framework for using GenAI to Create Formative Assessment

Framework

Testing the Framework

Analysis Based on the Framework

Accuracy

Pedagogy

Bias

Sustainability

Additional Analysis

References

Disclaimer

Leave a Reply Cancel reply

Evaluation Framework for using GenAI to Create Formative Assessment

Framework

Testing the Framework

Analysis Based on the Framework

Accuracy

Pedagogy

Bias

Sustainability

Additional Analysis

References

Disclaimer

Share this:

Leave a Reply Cancel reply