EGITIM VE BILIM, cilt.51, sa.225, ss.227-264, 2026 (SSCI, Scopus, TRDizin)
This study examined whether responses generated by chatbots (ChatGPT-3.5, ChatGPT-4, and Bard) about heat and temperature match misconceptions identified in the literature and how these responses compare to those of learners. The study also addressed the effect of Conceptual Change Texts (CCTs) on chatbot-generated responses about heat and temperature, focusing on their relevance to prompt engineering. Heat and Temperature Four-tier Misconception Test (HTMCT) and CCTs were utilized from a previous study that investigated the effectiveness of CCTs in remedying misconceptions about heat and temperature held by pre-service physics teachers. The HTMCT, consisting of 20 items, was designed to diagnose misconceptions about heat and temperature held by pre-service physics teachers as identified in the literature, with each misconception being assessed using multiple items. In this study, the HTMCT was used to diagnose the chatbots’ responses of the heat and temperature concepts before and after the implementation of CCTs. In addition, in-depth interviews with the chatbots were conducted to elaborate on their responses. Pre-service physics teachers in the prior study exhibited misconceptions about heat and temperature, which were effectively remediated by CCTs, leading to significant overall improvements. Similarly, this study found that chatbot-generated responses, except those from Bard, were prone to misconceptions. ChatGPT-4 consistently generated responses that aligned with the scientific paradigm, unlike the other two chatbots. However, pre- and post-test data revealed that ChatGPT-4-generated responses were prone to a misconception, specifically that equal amounts of heat supplied to different substances will result in the same final temperature, and these responses consistently reflected this misconception. Both ChatGPT-3.5 and Bard showed improved performance between the pre- and post-test data, despite providing inconsistent responses. While chatbots could generate responses that accurately expressed concept definitions, they struggled with drawing conclusions based on multiple scientific concepts, applying concepts to real-world scenarios, and engaging in complex reasoning. In this study, while the algorithms underlying the chatbots remain undisclosed, the post-test responses for all chatbots showed a notable decrease in incorrect responses and improved alignment with scientific knowledge, suggesting a positive influence of CCTs, akin to findings from the prior study.