Abstract
Following its initial release in November 2022, Chat GPT garnered an impressive one million subscribers signed up for its services within a week, establishing its rapid integration into educational contexts. This platform equips both educators and students with powerful tools in the delivery and acquisition of knowledge. AI technology facilitates real-time feedback, personalized suggestions, and evaluations, evoking parallels with the presence of a personal tutor. Evidently, Study.com reports an unsurprising that 89% of its students used AI for homework tasks. Despite the concerns held by educational professionals, AI has demonstrably increased student engagement and fostered greater levels of autonomy. This article aims to evaluate the use of AI technology by teachers in creating reading tasks and generating questions and answers, offering an overview of the potential advantages and disadvantages, along with the supporting rationale.
Keywords: Chat GPT, Reading Principles, CEFR Levels, Acquisition, Engagement, Autonomy, Automation, Graphs, Flesh Kincaid, Question Generation, Coherence, Quality.
Background
1.1 Principles of Reading Comprehension
The development of a comprehension text can be a laborious task. Texts typically require two parts: a lengthy, coherent passage and diverse question types aligning with the content. It also necessitates the creator to take into account a multitude of factors, including content relevance, reader interest, motivation, and organization structure (Lipson & Wixson, as cited in Ahmad et. al. 2017). Other scholars highlight the importance of addressing macro and micro skills characterized by features that help readers identification of main ideas, formulation of generalizations, inference drawing, and conclusion making (Brown 2004, as cited in Ahmad et. al. 2017).
1.2 Creating Texts
When creating texts, Xiao et. al. (2023) outlines five major considerations. The first surrounds the issue of readability. Students might be tasked with recognizing word groups, phonetic-graphemic associations, and word formation (Ahmad et. al., 2017). The second factor is correctness. This relates to the logical presentation of ideas in a grammatically accurate way. Another further consideration is related to coherence. Ideas need to be consistent and presented interestingly and engagingly. Lastly, text creators need to ensure questions align with the passage content and that answers are easily identifiable.
2. How Chat GPT works
Chat GPT employs two primary modalities for text generation. The first is what is defined as zero-shot learning (Xiao et. al., 2023). It uses context-specific commands without the need for examples to generate a response. The second option is what we call few-shot learning (Ibid, 2023). It can provide a much better response when furnished with example responses and relevant sources to draw from. Indeed, Xiao et. al (2023) found that few-shot texts generated by Chat GPT can surpass those originating from textbooks.
Despite these capabilities, one of the biggest problems with AI-produced text is that individuals can be able to differentiate between human and AI-generated prose (Ibid, 2023). The identifiability of AI-generated text raises concerns regarding its suitability for educational settings. The goal, then, is to leverage the use of AI-generated texts in a way that closely emulates human-generated text as much as possible. Indeed, research has indicated that Chat GPT can effectively persuade readers of the human origins of certain texts, as it can successfully imitate the style and structure of provided reference material (Ibid, 2023). Nevertheless, it is important to incorporate mechanisms to confirm this emulation has been achieved.
3. Practical Example
The inspiration for this paper is derived from the creation of a 600-word discursive-style article for Year 1 Learners, exploring the topic of Learner Autonomy. Some ideas were presented visually in the form of a graph. Text was aimed at the B2 proficiency level of the CEFR, and had a target FK score of 11. The CEFR is the Common European Framework of Reference for Languages. It is a standardized framework that provides a common basis for the description of language proficiency levels in different European languages. CEFR divides language proficiency into six levels, ranging from A1 (beginner) to C2 (proficient). Flesch-Kincaid readability tests are a set of readability formulas designed to assess the ease of understanding and readability of written text in English. The first formula calculates how difficult a text is to read: The higher the score, the easier the text is to read. The second formula is based on US school grade levels and provides an estimate of the educational level a person needs to have in order to easily understand. For example, a score of 8.0 would mean that an eighth-grader should be able to understand the text. Chat GPT 3.5 was used to generate a draft text, alongside associated questions, answers, and explanations. For the questions, a range of question types was required: vocabulary matching, identification of main ideas, sentence function, and True/False/Not Given style questions. Chat GPT was then asked to evaluate the text against the CEFR framework and to assign an FK readability score.
4.1 Analysis
Automation & CEFR Levels
Previously, creating a text would have demanded a substantial investment of time and effort. The advent of AI has enabled educators to automate much of the work. However, care has to be taken to ensure the task meets the desired criteria. An effective approach to achieve this was to ask Chat GPT to evaluate its output against the CEFR framework (See Figure 1).
4.2 Coherence & Flesch Kincaid
When asked to evaluate the generated text, Chat GPT concluded that ideas were presented coherently (Figure 1, Criterion 3). However, it seems to have a tendency to oversimplify ideas through the use of bullet points. In order to avoid this, it was instructed to re-write the text to make greater use of cohesive devices and linking words. This resulted in a revised text that was not only more cohesive but which more closely followed the norms of academic style.
Another problem relates to Flesch Kincaid levels. Although it was asked to provide ideas aimed at the FK11 level, output seemed to be inconsistent with external FK checkers. In this example, Chat GPT incorrectly identified word length as being 384 words, when it should have identified the text as being closer to 600. This then skewed the results used in the FK formula calculation. While Chat GPT assigned a score of 9.6 (Figure 3a) other calculators scored it at 11.4 (see Figure 3b). Therefore, it is recommended text writers carefully evaluate output to ensure it meets the required standards.
4.3 Graphical Analysis – Limitations and Potential Use Cases
A significant limitation of AI software is its inability to analyze graphical information. The specific AI software for this type of task is less common. Some AI packages offer this feature as an additional service, while others omit it entirely. For example, Chat GPT 3.5 currently does not support this function (See Figure 2a) but it is supported in the paid version (otherwise known as Chat GPT Vision). Similarly, this function appears to be unsupported by XIPU AI 4 (see Figure 2b). Microsoft, however, does incorporate this function within their Bing.com browser (otherwise known as ‘Co-pilot’ see Figure 2c). However, this lack of functionality could be seen as an advantage: designing assessments in ways AI is least effective may help reduce plagiarism issues in the future.
4.4 Question Creation
Chat GPT is perhaps at its most useful when creating questions. It was able to successfully create a range of questions according to the criteria with minimal changes. For example, it was asked to identify the ten most challenging words from the text and then provide definitions. However, certain definitions were considered too difficult for B2, and Chat GPT was asked to rewrite these again for this level. For all other question types (Identifying Main Ideas, T/F/NG, Definitions, Sentence Function), it was able to successfully generate appropriate questions. Once complete, all questions were sanity-checked for their suitability and accuracy.
4.5 Answers and Rationale
Upon prompting, Chat GPT was able to provide not only the answers but also the rationale behind them. The rationale was exhibited overall, except for one question, which required human intervention to help enhance coherence. This included adding ideas relating to hedging language and bolding important ideas for enhanced clarity.
5. Conclusion
Chat GPT offers a significant opportunity for automating the text creation process, achieving indistinguishability from human writing so long as it has been given appropriate guidance and fed the appropriate models. While there are many advantages to using AI for text creation, one of the main benefits is its ability to objectively evaluate what it has produced against a set of pre-defined criteria in order to help humans confirm its usefulness. However, AI users do need to be critical of the feedback it provides due to the text’s tendency to be written in bullet form and the occasional lack of cohesive devices. Challenges also arise with its calculation of Flesh Kincaid levels and the platform’s inability to analyze visual data such as charts and graphs. Nevertheless, Chat GPT’s strength lies in its ability to generate ideas as a starting point. Notably, its most valuable function is the ability to automate the generation of questions, answers, and corresponding rationale. However, the drawback is that not all AI features are freely available across all platforms and some offer better functionality than others. In conclusion, while AI serves as a useful resource-creation tool, it must be used cautiously with the appropriate mechanisms in place to ensure appropriate standards are maintained.
References
Ahmad, M., Shakir, Dr. A, Aqeel, M. & Siddique, A.R. (2017), ‘Principles for Devising a Reading Comprehension Test: A Library Based Review’, Al-Qalam, December 2017, Available at:
www.researchgate.net/publication/339939385 Accessed: February 2024
Xiao, C., Xin Xu, S. Zhang, K. Wang, Y. & Xia, L. (2023), ‘Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications’, Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 610–625, Association for Computational Linguistics, Available at
https://aclanthology.org/2023.bea-1.52.pdf Accessed: February 2024
Appendices
Figure 1: Chat GPT 3.5 Text Analysis Using CEFR Descriptors
Figure 2a: Limitations with Chat GPT 3.5 Using Graphs and Charts
Figure 2b: XIPU AI Version 4 Limitations with Graphs and Charts
Figure 2C: (Copilot) Chat GPT 4 (Microsoft) Graphical Analysis
Figure 3a: Errors with Chat GPT 3.5 FK Formula Calculation
Figure 3b: Same text analyzed by an external FK Calculator