In the ever-evolving landscape of healthcare, artificial intelligence (AI) has emerged as a transformative force, promising to revolutionize patient care and clinical decision-making. However, the effectiveness of AI models hinges on the quality and diversity of the data they are trained on. Traditional models predominantly rely on electronic health records (EHRs), which, while rich in biological data, often overlook the nuanced interactions between patients and clinicians. This gap has led to AI systems that may lack a comprehensive understanding of the patient experience.
The Need for Multimodal Data
Recognizing this limitation, researchers have embarked on initiatives to capture the full spectrum of patient-clinician interactions. A notable example is the development of a longitudinal, multimodal recording system designed to document real-world clinical encounters. This innovative system integrates 360-degree video and audio recordings with patient surveys and EHR data, aiming to create a rich dataset that mirrors the complexity of actual healthcare interactions.
Study Design and Implementation
Conducted at an academic outpatient endocrinology clinic, the study involved adult patients attending in-person visits. With the consent of both clinicians and patients, each encounter was recorded using a 360-degree video camera, capturing the entirety of the consultation. Following the visit, patients completed surveys assessing aspects such as empathy, satisfaction, pace, and treatment burden. Additionally, demographic and clinical data were extracted from the EHRs to provide a comprehensive context for each interaction.
Feasibility and Early Findings
The study’s feasibility was evaluated based on several key endpoints: clinician consent, patient consent, recording success, survey completion, and data linkage across modalities. By August 2025, the study had achieved significant milestones: 97% of eligible clinicians (35 out of 36) and 75% of approached patients (212 out of 281) had consented to participate. Of the consented encounters, 76% (162 out of 213) had complete recordings, and 96% (204 out of 213) of patients completed the follow-up surveys. These promising results underscore the feasibility of capturing the multimodal dynamics of patient-clinician encounters.
Implications for AI in Healthcare
The integration of diverse data sources—video, audio, surveys, and EHRs—provides a multifaceted view of the clinical encounter. This approach addresses the limitations of existing AI models that often rely solely on EHRs, which may not fully capture the interpersonal aspects of patient care. By incorporating real-world interactions, AI systems can be trained to recognize and interpret the subtleties of patient-clinician communication, leading to more empathetic and effective healthcare solutions.
Challenges and Considerations
While the study demonstrates the feasibility of this approach, several challenges remain. Ensuring patient privacy and obtaining informed consent are paramount, especially when recording sensitive health information. Additionally, the integration of multimodal data requires sophisticated analytical tools to process and interpret the complex interactions captured. Future research will need to address these challenges to refine the system and expand its applicability across different medical specialties.
Conclusion
The development of a longitudinal, multimodal recording system marks a significant advancement in AI research within the healthcare sector. By capturing the richness of real-world patient-clinician interactions, this approach lays the groundwork for AI models that are more attuned to the human elements of care. As the study progresses, it holds the potential to transform how AI is integrated into healthcare, ensuring that technology serves to enhance, rather than replace, the human touch in medicine.
References
-
Al Zahidy, M., Guevara Maldonado, K., Andrango, L. V., Proano, A. C., Claros, A. G., Jimenez, M. L., Toro-Tobon, D., Ponce-Ponce, O. J., & Brito, J. P. (2025). Longitudinal and Multimodal Recording System to Capture Real-World Patient-Clinician Conversations for AI and Encounter Research: Protocol. arXiv preprint. (arxiv.org)
-
Johnson, K. B., Alasaly, B., Jang, K. J., Eaton, E., Mopidevi, S., Koppel, R., & the AI-4-AI Lab. (2025). Observer: Creation of a Novel Multimodal Dataset for Outpatient Care Research. Journal of the American Medical Informatics Association. (pubmed.ncbi.nlm.nih.gov)
-
Addlesee, A., Cherakara, N., Nelson, N., Hernandez Garcia, D., Gunson, N., Sieińska, W., Dondrup, C., & Lemon, O. (2024). Multi-party Multimodal Conversations Between Patients, Their Companions, and a Social Robot in a Hospital Memory Clinic. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). (aclanthology.org)
-
Masayoshi, K., Hashimoto, M., Yokoyama, R., Toda, N., Uwamino, Y., Fukuda, S., Namkoong, H., & Jinzaki, M. (2025). EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol. arXiv preprint. (arxiv.org)
-
Wang, E. H., & Wen, C. X. (2025). When Neural Implant Meets Multimodal LLM: A Dual-Loop System for Neuromodulation and Naturalistic Neural-Behavioral Research. arXiv preprint. (arxiv.org)

Be the first to comment