TY - GEN
T1 - Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human-in-the-Loop
AU - Afzal, Anum
AU - Kowsik, Alexander
AU - Fani, Rajna
AU - Matthes, Florian
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of SAP SE to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot’s response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.
AB - Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of SAP SE to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot’s response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.
UR - http://www.scopus.com/inward/record.url?scp=105000823923&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:105000823923
T3 - DaSH 2024 - Data Science with Human-in-the-Loop, Proceedings of the DaSHWorkshop at NAACL 2024
SP - 4
EP - 16
BT - DaSH 2024 - Data Science with Human-in-the-Loop, Proceedings of the DaSHWorkshop at NAACL 2024
A2 - Dragut, Eduard
A2 - Li, Yunyao
A2 - Popa, Lucian
A2 - Vucetic, Slobodan
A2 - Srivastava, Shashank
PB - Association for Computational Linguistics (ACL)
T2 - 5th Workshop on Data Science with Human-in-the-Loop, DaSH 2024 at NAACL
Y2 - 20 June 2024
ER -