Generative AI in Nursing Documentation: Literature Review
- ANA California Staff

- Nov 3
- 25 min read
Authored by: Milena Mardahay, BSN, RN, CGRN, ANA\California Advocacy Institute Fellow 2025
Abstract
The integration of generative artificial intelligence (AI) into nursing documentation systems represents one of the most significant developments in informatics for modern clinical practice. From 2023 through mid-2025, research across academic, government, and healthcare settings has explored the use of large language models (LLMs)—notably GPT-3.5, GPT-4, Claude, Gemini, and domain-specific models—in tasks ranging from charting assistance to autonomous ambient documentation and triage decision support.
This review systematically synthesizes empirical findings from 30 studies, focusing on the measurable impacts of AI on documentation efficiency, accuracy, user satisfaction, and clinical integration. Key themes include model evolution and performance variability, usability across inpatient and ambulatory settings, implementation risks, and ethical considerations. While newer models like GPT-4 and Claude 3 demonstrate marked improvements in output quality and hallucination reduction, challenges remain in EHR integration, scope-of-practice governance, and equitable deployment.
This review offers a comparative, model-specific analysis of generative AI performance in nursing workflows and provides policy-aligned recommendations for responsible implementation.
Keywords: generative AI, large language models, nursing documentation, GPT-4, ambient scribing, charting efficiency, health informatics

1. Introduction
Nursing documentation is a core component of clinical practice and professional accountability, yet it remains one of the most time-consuming and error-prone elements of the nursing workflow. Over the past two decades, successive waves of health IT have attempted to streamline this burden—from early electronic health records (EHRs) to structured templates and voice-to-text tools—often with mixed results. The recent emergence of generative artificial intelligence (AI), particularly large language models (LLMs), introduces a new paradigm in documentation: models capable of drafting, refining, and integrating narrative clinical content with minimal manual input.
In contrast to narrow AI applications, generative models can generate coherent, contextually relevant documentation from ambient audio (Falcetta et al., 2024), structured prompts (Zaretsky et al., 2024), or clinical note inputs (Biswas & Talukdar, 2024). The promise of such tools includes reduced charting time, improved documentation quality, and lower cognitive burden for clinicians—all of which could potentially free nurses to return their focus to patient care. Simultaneously, these tools raise pressing concerns about transparency, error propagation, scope-of-practice conflicts, and regulatory readiness (Reddy, 2024; Pomeroy & Yang, 2023).
This literature review addresses the following research question:
What are the measurable impacts of generative AI integration on charting time, documentation quality, and clinician satisfaction in nursing workflows, and how can governance frameworks address emerging ethical, privacy, and autonomy concerns?
The review focuses on empirical evidence published between January 2023 and August 2025, restricted to studies with explicit nursing relevance. The synthesis emphasizes how model performance, risks, and workflow suitability have evolved across model generations, including GPT-3.5, GPT-4/4o, Claude 2/3, Gemini 1.5, Med-PaLM 2, and open-weight models such as LLaMA and Mistral. The goal is not only to evaluate whether generative AI improves nursing documentation, but also to understand under what conditions, with which models, and at what potential cost to practice integrity and patient safety.
2. Search Strategy and Study Selection
This review synthesizes literature published between January 2023 and August 2025 on the integration of generative artificial intelligence (AI) into nursing documentation and clinical workflows. A total of 37 empirical studies were identified through manual searches across academic databases, including Google Scholar, Scopus, PubMed, and CINAHL, as well as from an evidence export generated using the Elicit synthesis tool.
2.1 Databases and Sources
Studies were drawn from peer-reviewed journals and preprint servers indexed in PubMed, JMIR, PLOS ONE, Frontiers, and ArXiv, among others. Where duplicates or preprint–final publication pairs existed, the most complete or peer-reviewed version was retained.
2.2 Inclusion Criteria
Studies were included if they:
Were published between January 2023 and August 2025;
Focused on the use of generative AI or large language models (LLMs) in nursing documentation or interprofessional charting with nursing relevance;
Reported empirical findings (quantitative, qualitative, or mixed methods);
Described the model used (e.g., GPT-4, Claude, Med-PaLM 2), the clinical setting, and at least one outcome related to documentation, accuracy, satisfaction, efficiency, safety, or workflow integration;
Were written in English.
2.3 Exclusion Criteria
Studies were excluded if they:
Presented purely theoretical or opinion-based content without original data;
Focused exclusively on physician workflows without implications for nursing practice;
Reported only technical performance benchmarks without a clinical deployment or simulation component;
Were duplicate records of preprint and journal versions.
2.4 Screening Process
All records were initially screened at the title and abstract level for relevance to nursing documentation and generative AI. Full texts were then reviewed for eligibility and extraction by a single reviewer, using pre-defined criteria based on relevance to clinical documentation, empirical design quality, and model transparency. Risk of selection bias was mitigated by adherence to PRISMA-aligned screening logic.
2.5 PRISMA Flow Summary
Records identified from database searches and Elicit export: 141
Duplicates or incomplete records removed: 14
Records screened by title and abstract: 127
Full-text articles assessed for eligibility: 40+
Studies included in synthesis: 37
(Studies identified via the Elicit spreadsheet but lacking direct PDF access were verified through DOI and included when eligible.)
3. Data Extraction and Quality Appraisal
Data were systematically extracted from 37 empirical studies investigating generative AI integration into nursing documentation and clinical workflows. Extraction categories included study design, setting, sample size, clinical domain, AI model name/version, prompt method or integration style (e.g., ambient scribe, structured input), comparator groups if applicable, primary and secondary outcomes, statistical metrics, limitations, and deployment stage (e.g., sandbox, pilot, production). Where available, effect sizes, confidence intervals, and thematic categories from qualitative analyses were recorded.
3.1 Extraction Process
Structured data extraction was conducted by a single reviewer using a matrix-based tool in Microsoft Excel and Python. This matrix integrated data from full-text PDFs and the structured Elicit output. When discrepancies between PDF and spreadsheet values were identified, the full-text version was prioritized. Studies without explicit reporting of sample size or model version were annotated as "not reported."
3.2 Summary of Study Designs and Methods
Of the 37 studies:
11 were prospective observational or cohort designs;
8 were randomized controlled or quasi-experimental trials;
6 were retrospective chart reviews;
5 were simulation-based evaluations;
7 used mixed-methods approaches incorporating survey data or qualitative thematic analysis.
Clinical domains represented included emergency medicine (6), hospice/palliative (2), perioperative (3), inpatient medicine (10), ambulatory care (5), and education/training (4). Sample sizes ranged from N=6 (pilot usability studies) to N>10,000 (retrospective EHR analyses).
3.3 AI Models and Integration Modes
Among the LLMs reported:
GPT-3.5 was used in 8 studies;
GPT-4 in 13 studies;
Claude 2 or 3 in 4 studies;
Med-PaLM 2 in 2 studies;
Gemini 1.5 in 1 study;
Open-weight models (LLaMA 2/3, Mistral, local RAG agents) in 4 studies;
Model not specified in 5 studies.
Integration methods varied by model maturity. Ambient scribe tools (e.g., Abridge) dominated GPT-4 deployments, while GPT-3.5 and Claude models were often tested using structured inputs. Prompt engineering and retrieval-augmented generation (RAG) were explicitly described in 11 studies.
3.4 Quality Appraisal Framework
The methodological quality of each study was assessed using the appropriate CASP tool based on design type (e.g., CASP RCT, CASP Cohort, CASP Qualitative). Most studies exhibited moderate-to-high quality with common limitations including:
Small, single-site samples (n=18);
Simulated cases lacking real-time clinical deployment (n=9);
Lack of statistical testing (n=7);
Unclear model provenance or prompt transparency (n=6).
Risk of bias was assessed qualitatively and flagged where evaluators were not blinded, outcomes were self-reported, or industry sponsorship was noted.
3.5 Table of Study Characteristics
See Table 1 for Study, Year, Country, Setting, Design, N, AI Model/Version, Integration Method, Outcomes, Key Findings, Limitations.
4. Thematic Synthesis
This section synthesizes findings from 37 empirical studies into seven evidence-based themes reflecting generative AI’s influence on nursing documentation, clinical workflows, and decision-making between 2023 and 2025. Each theme includes summary outcomes, model versions, and longitudinal progression of performance from GPT-3.5 to GPT-4/4o, Claude 2/3, Med-PaLM 2, Gemini, and LLaMA-class models.
4.1 Documentation Efficiency and Quality
Across studies, documentation efficiency was the most commonly evaluated endpoint (n=22). Charting time was reduced by 20–40% in real-world deployments of ambient AI scribes using GPT-4-class models (Biswas & Talukdar, 2024; Ju et al., 2025). GPT-3.5 studies showed modest improvements (10–18%) in structured-input simulations but struggled with note coherence and accuracy without human edits (Kirpalani et al., 2024).
Abridge’s ambient documentation tool, powered by GPT-4 and deployed at Stanford and Tampa General, reduced nurse charting time in medication reconciliation and assessment fields, though quantitative error reduction metrics were not disclosed (Pomeroy & Yang, 2023). Claude 2 and Gemini 1.5 performed comparably to GPT-4 in documentation completeness but underperformed in grammar and medication reconciliation accuracy (Alkhalaf et al., 2024).
4.2 Diagnostic and Assessment Support
Ten studies evaluated LLM-assisted diagnostic triage or case assessment. GPT-4 models outperformed both GPT-3.5 and clinicians in simulated ED case reviews, with diagnostic accuracy improvements of 12–18% (Hoppe et al., 2024; Mahdi et al., 2025). GPT-3.5 underperformed, especially in cases involving rare presentations or ambiguous findings (Kirpalani et al., 2024).
Claude 3’s long-context input improved continuity of assessments across multi-encounter cases, but its over-cautiousness reduced specificity in hospice triage tools (Prestia, 2024). Med-PaLM 2 scored high in structured QA benchmarks but was not field-tested in clinical settings within the review period.
4.3 Patient Safety and Risk
Eleven studies discussed safety or harm-related outcomes. Hallucination risk was model-dependent and inversely correlated with model sophistication: GPT-3.5 generated factual inaccuracies in 22–36% of outputs (Draschi et al., 2023; Zichen et al., 2024), while GPT-4 reduced this to 8–12% in production use (Biswas & Talukdar, 2024).
Studies deploying ambient AI reported few adverse events but emphasized the need for human-in-the-loop governance, especially in settings involving medication orders, code status, or high-acuity triage (Clough et al., 2024). Open-weight models posed higher hallucination and bias risk unless fine-tuned and externally validated.
4.4 Workflow Integration and Usability
Integration maturity varied widely. GPT-4-enabled ambient scribing (e.g., Abridge) and Epic SmartText integrations were positively rated for usability and task offloading (Chen et al., 2024). However, some studies noted workflow misalignment where ambient tools captured excessive irrelevant detail, requiring nurse post-editing (Herrera-Sánchez & Moreira-Flores, 2023).
GPT-3.5 and Claude models required structured prompting, which increased cognitive load for clinicians. Studies emphasized the importance of prompt standardization and system training (Prestia, 2024).
4.5 Equity, Bias, and Generalizability
Bias and generalizability were noted in 12 studies. LLMs trained on English-dominant corpora struggled with multilingual input, regional phrasing, and culturally embedded references. Claude 3 showed improved neutrality in patient assessments compared to GPT-3.5 but was not evaluated in racially diverse environments (Omon et al., 2025).
Open-weight models had the highest generalization variance, particularly when deployed without domain-specific fine-tuning. Studies called for standardized demographic audits and cross-site replication.
4.6 Regulatory, Privacy, and Governance
Data privacy, auditability, and institutional approval processes were barriers to LLM deployment in clinical environments (Feng et al., 2022; Clough et al., 2024). Most GPT-4 deployments operated in sandbox or hybrid environments with strict PHI tokenization and audit trails.
Studies deploying open-weight models (LLaMA, Mistral) emphasized local control and on-prem inference as strengths for HIPAA compliance, though tradeoffs in performance and ease of deployment were noted.
4.7 Economics and Time-Motion Efficiency
Nine studies estimated financial or staffing benefits of AI integration. Time-motion analyses from Ju et al. (2025) and Chen et al. (2024) estimated 30–40% savings in documentation time per nurse per shift, translating to indirect cost savings or patient volume expansion. However, no RCTs directly linked AI deployment to improved staffing ratios, billing accuracy, or patient throughput.
Several simulation studies emphasized return on investment (ROI) potential, but external funding and industry sponsorship limited the interpretability of financial claims (Bracken et al., 2025).
These themes provide the foundation for model-by-model comparison (Section 6) and implications for future deployment, regulation, and nursing governance (Section 9).
5. Methodology of This Review
This review follows a scoping review framework, appropriate for synthesizing an emerging and interdisciplinary evidence base across varied methodologies, settings, and model architectures. Given the heterogeneity of included studies—spanning observational, quasi-experimental, technical evaluations, and implementation case studies—a narrative synthesis was used rather than meta-analysis.
5.1 Review Question and Framing
The primary review question was:
What are the measurable impacts of generative AI (2023–2025) on nursing documentation, clinical decision-making, and workflow efficiency?
This was further structured using a PECOT format:
P (Population): Nurses, nursing informatics professionals, interdisciplinary clinical staff
E (Exposure): Generative AI and LLMs (e.g., GPT-3.5, GPT-4, Claude, Gemini, Med-PaLM 2)
C (Comparator): Pre-AI workflows or clinician-only documentation
O (Outcomes): Charting time, note accuracy, diagnostic support, usability, safety, equity, and cost
T (Timeframe): January 2023 to August 2025
5.2 Synthesis Approach
Given the high variance in study design, outcome measures, and model types, a narrative thematic synthesis was conducted. Thematic categories were iteratively developed from repeated reading of the data extraction tables and aligned with prior scoping reviews in clinical AI (Park et al., 2024). Cross-study patterns and discrepancies were noted, with particular emphasis on model evolution (GPT-3.5 → GPT-4/4o), setting (inpatient vs. outpatient), and deployment mode (sandbox vs. production).
5.3 Review Limitations
Data heterogeneity: Studies varied widely in sample size, endpoints, and reporting quality, limiting comparative synthesis.
Non-randomized designs: Most studies were pilot or retrospective in nature; only a few quasi-experimental designs or RCTs were identified.
Language and publication bias: Only English-language sources were included; grey literature was limited to accessible preprints.
Single-reviewer extraction: Data extraction and appraisal were conducted by one reviewer, which may introduce subjectivity despite use of structured extraction templates.
Temporal limitation: As of mid-2025, long-term outcomes of AI deployment in nursing remain underreported.
Despite these constraints, this scoping review provides the most comprehensive synthesis to date of generative AI's measurable effects on nursing documentation and clinical practice across real-world and simulated environments.
6. Comparative Analysis by Model (2023–2025)
This section contrasts the performance, reliability, and clinical relevance of key large language model (LLM) families deployed in nursing documentation and clinical workflows between 2023 and 2025. Analysis focuses on GPT-3.5, GPT-4/4o, Claude 2/3, Gemini 1.5, Med-PaLM 2, and open-weight models like LLaMA and Mistral. Each model class is reviewed for usability, hallucination rate, deployment feasibility, and integration into healthcare systems.
6.1 GPT-3.5
GPT-3.5 served as the baseline model in many early studies (2023–early 2024). It showed modest improvements in documentation speed (10–18%) and diagnostic support but struggled with hallucination control and consistency in clinical terminology (Kirpalani et al., 2024; Draschi et al., 2023). Structured prompting was necessary, increasing clinician cognitive load. Simulation studies found 22–36% of outputs contained factual errors or inconsistencies (Zichen et al., 2024).
6.2 GPT-4 and GPT-4o
GPT-4-class models consistently outperformed predecessors in accuracy, safety, and ambient scribing utility. Real-world deployments (e.g., Stanford, Tampa General) using ambient tools like Abridge demonstrated 20–40% charting time reductions and improved note completeness (Pomeroy & Yang, 2023; Ju et al., 2025). GPT-4 hallucination rates dropped to 8–12% in production environments. GPT-4o introduced multimodal reasoning and faster inference but was not yet extensively field-tested in nursing settings by mid-2025.
6.3 Claude 2 and Claude 3
Claude 2 demonstrated high harmlessness and low hallucination rates in QA benchmarks, with Claude 3 improving long-context processing useful for multi-encounter continuity. Claude 3 was favored for hospice triage accuracy but rated lower in medication fielding and clinical grammar precision compared to GPT-4 (Prestia, 2024). Prompt alignment and structured inputs remained necessary.
6.4 Gemini 1.5
Gemini 1.5 showed strong multimodal capacity and context handling in synthetic chart summarization but lagged in deployment maturity. Alkhalaf et al. (2024) found it comparable to GPT-4 in content extraction but slightly weaker in medication reconciliation. Limited evidence exists for Gemini’s clinical integration, and concerns about Google Cloud HIPAA compliance slowed adoption.
6.5 Med-PaLM 2
Med-PaLM 2 performed strongly on clinical QA benchmarks and reasoning tests but lacked bedside evaluation. While the model demonstrated superior calibration and safety in controlled tests (Mahdi et al., 2025), no peer-reviewed deployment studies involving nurses were found in this review period.
6.6 LLaMA, Mistral, and Other Open-Weight Models
Open-weight models like LLaMA 2/3 and Mistral offered on-premises control for institutions concerned with data governance. However, they underperformed out-of-the-box in diagnostic precision and charting accuracy compared to fine-tuned proprietary models. Feng et al. (2022) warned of high hallucination and bias risk without extensive local training. These models showed promise when domain-tuned but required substantial infrastructure.
6.7 Timeline Summary
These findings establish the platform-specific trade-offs that inform governance, procurement, and workforce training decisions addressed in later sections.
7. Major Findings and Outcomes
The synthesis of 37 studies between 2023 and 2025 reveals a consistent pattern: generative AI tools, particularly those powered by GPT-4-class and Claude 3 models, enhance nursing documentation efficiency, diagnostic support, and user satisfaction—though results vary by clinical setting, workflow integration, and model maturity.
7.1 Documentation Efficiency and Accuracy
Across all models, the most replicated benefit was charting time reduction. Studies deploying GPT-4-based ambient scribes (e.g., Abridge) reported median reductions of 30–40% per shift (Ju et al., 2025; Chen et al., 2024). In simulated or structured input studies using GPT-3.5 or Claude 2, savings were more modest (10–20%) and contingent on structured prompts (Kirpalani et al., 2024).
Note quality and completeness improved significantly with GPT-4 and Claude 3, particularly in medication reconciliation, narrative clarity, and standardized terminology. However, ambient systems sometimes over-captured irrelevant details, requiring nurse post-editing (Herrera-Sánchez & Moreira-Flores, 2023).
7.2 Clinical Reasoning and Diagnostic Support
GPT-4 and Claude 3 models consistently outperformed clinicians in structured case evaluations—particularly in emergency and triage scenarios (Hoppe et al., 2024). Accuracy improvements ranged from 12% to 18% above baseline for GPT-4 (Mahdi et al., 2025). However, these models still lagged in rare condition detection and were sensitive to ambiguous prompts.
7.3 Safety and Hallucination Reduction
Factual error rates declined substantially from GPT-3.5 to GPT-4: 22–36% hallucination in early models dropped to 8–12% in sandboxed GPT-4 systems (Draschi et al., 2023; Biswas & Talukdar, 2024). Claude 3 exhibited similar safety performance but lacked deployment scale. No serious adverse events were reported, though most studies emphasized human-in-the-loop safeguards.
7.4 User Satisfaction and Usability
Clinicians rated ambient GPT-4 tools and SmartText integrations favorably, noting ease of use, time savings, and improved workflow continuity (Pomeroy & Yang, 2023; Chen et al., 2024). Usability scores declined in studies requiring manual prompt design or frequent output correction (Prestia, 2024).
7.5 Variability by Setting and Workflow
The magnitude of benefits varied by environment:
Inpatient/acute care: Ambient tools reduced documentation burden significantly.
Outpatient and hospice: Prompted models were feasible but limited by input complexity.
Simulation studies: Demonstrated theoretical potential, but lacked deployment realism.
7.6 Adverse Outcomes and Mitigation
No studies reported direct harm from generative AI outputs. However, several highlighted near-miss documentation errors, particularly in ambiguous triage cases or when post-editing was bypassed. Risk was highest in GPT-3.5 outputs and open-weight models lacking domain-specific fine-tuning.
These findings support the growing consensus that generative AI can enhance nursing documentation and decision-making—but only under specific governance, integration, and supervision conditions addressed in later sections.
8. Limitations (Evidence Base and Models)
Despite promising outcomes, the evidence base and underlying LLMs present several limitations that constrain generalizability, safety, and integration feasibility across clinical settings. This section details the methodological and technical limitations observed from 2023 to 2025.
8.1 Evidence-Based Limitations
Sample Size and Generalizability: Most studies featured small or moderate sample sizes, often limited to single sites or simulation environments. This reduces external validity, particularly for rural, high-acuity, or non-academic settings (Zichen et al., 2024; Prestia, 2024).
Design Heterogeneity: Study designs varied widely—ranging from retrospective chart reviews and simulation tasks to prospective pilot deployments—making meta-analysis or cross-comparison difficult. Standardized outcome definitions were rare.
Metric Variability: Studies used diverse metrics to assess efficiency (e.g., documentation time per encounter, keystrokes, post-editing rate) and accuracy (e.g., ICD match rate, narrative coherence). This heterogeneity limits quantitative synthesis and benchmarking.
Simulation Bias: Many early studies (e.g., GPT-3.5 benchmarks) were conducted in artificial or synthetic settings, reducing ecological validity. Results often overestimated performance compared to real-world EHR deployments (Kirpalani et al., 2024).
Publication Bias and Industry Sponsorship: Several studies were affiliated with vendors or sponsored by technology firms (e.g., Bracken et al., 2025), raising potential conflicts of interest. Null findings or negative outcomes were underreported.
8.2 Model-Specific Limitations
Hallucination and Prompt Sensitivity: While hallucination rates decreased with GPT-4 and Claude 3, outputs remained prompt-sensitive. Slight variations in wording yielded divergent results, especially in clinical interpretation tasks (Draschi et al., 2023).
Context Window and Fragmentation: GPT-3.5 and early Claude models struggled with long-note continuity. Claude 3 improved context retention, but fragmentation persisted in multiday or multi-encounter documentation.
Data Leakage and Privacy Risks: Proprietary models trained on public web data risk exposure of outdated or non-compliant content. Studies emphasized the importance of PHI tokenization, audit trails, and sandboxed deployment (Clough et al., 2024; Feng et al., 2022).
Model Drift and Non-Determinism: Models updated over time without transparent versioning (e.g., OpenAI) created challenges for reproducibility and clinical validation. Outputs varied with minor interface or backend changes.
Bias and Linguistic Limitations: LLMs showed differential performance across languages, dialects, and cultural references. No reviewed model was comprehensively evaluated in racially diverse, multilingual, or low-resource clinical contexts.
8.3 Evolution of Limitations Over Time
Between 2023 and 2025, several limitations improved (e.g., hallucination, documentation coherence), while others remained unresolved or became more complex:
Improved: Hallucination rates (↓), prompt alignment (↑), note quality (↑).
Persistent: Reproducibility challenges, linguistic bias, lack of transparency in model updates.
Emerging: Governance burden, over-dependence on proprietary APIs, unclear liability in hybrid human-AI documentation workflows.
These limitations contextualize the major findings in Section 7 and directly inform the governance and practice implications in Section 9.
9. Implications for Nursing Practice, Education, and Policy
The integration of generative AI into nursing workflows presents substantial opportunities to reduce documentation burden, improve diagnostic precision, and streamline care delivery. However, as evidenced by Sections 4–8, these gains are conditional on rigorous oversight, tailored deployment, and ethical governance. This section outlines practical and policy-relevant implications for frontline nursing, educational curricula, and institutional governance.
9.1 Nursing Practice and Workforce Design
Generative AI shifts the nature of nursing documentation from manual data entry to supervision of machine-generated content. This reframing demands new competencies in reviewing, validating, and curating AI-assisted notes. Nurse staffing models may need to adapt, allocating time for post-editing and oversight rather than raw charting.
Clinical roles will stratify: bedside nurses may rely on ambient AI tools for initial drafts, while informatics-trained nurses or dedicated reviewers provide quality assurance. Human-in-the-loop validation remains essential, especially in high-acuity units, hospice, and perioperative settings, where LLMs may hallucinate or misclassify critical details.
9.2 Education and Competency Frameworks
Nursing curricula must evolve to include digital literacy, prompt engineering, AI supervision, and ethical evaluation. As generative tools become embedded in EHRs (e.g., Epic SmartText), all nurses—not just informatics specialists—will require training on how to critically appraise AI outputs.
Certification bodies and accrediting agencies should integrate AI competencies into continuing education frameworks. Simulated chart review with AI-generated notes may provide a safe training environment to build these skills.
9.3 Institutional Governance and Procurement
Hospitals and health systems adopting generative AI tools must establish governance structures that ensure safety, accountability, and equity. This includes:
Vetting model provenance and bias testing prior to deployment.
Formal audit trails for all AI-generated documentation.
Clear assignment of liability for errors in hybrid human-AI documentation workflows.
Ongoing performance monitoring, including hallucination tracking and user feedback integration.
Procurement decisions should weigh cost-efficiency against transparency, adaptability, and on-premise deployment options where privacy constraints demand local control.
9.4 Policy Recommendations
State boards of nursing and national policy entities (e.g., ANA, CMS) should:
Issue guidance on AI-assisted documentation standards and acceptable use.
Support funding for multi-site trials of generative AI tools in diverse settings.
Require bias audits and demographic performance reporting as part of regulatory approval.
Establish a central repository of validated prompts and LLM workflows for nursing documentation.
9.5 Research Agenda
Future work should prioritize:
Randomized controlled trials comparing LLM-assisted vs. traditional documentation across units.
Cost-effectiveness studies incorporating staffing impact, patient throughput, and billing accuracy.
Equity audits evaluating model performance across racial, linguistic, and rural/urban subgroups.
Frameworks for AI competency evaluation in pre-licensure and continuing education.
Generative AI is no longer merely a technological add-on, it is now a catalyst for nursing documentation transformation. Effective governance, robust training, and thoughtful integration will determine whether these tools augment clinical judgment or introduce new risks to safety and equity.
10. Conclusion
Between 2023 and 2025, generative artificial intelligence (AI)—especially large language models (LLMs) such as GPT-3.5, GPT-4, Claude 3, and Gemini 1.5—transitioned from early-stage experimentation to targeted clinical deployment within nursing documentation. The reviewed literature consistently demonstrates that generative AI can reduce documentation time, improve the accuracy and completeness of clinical notes, and enhance the efficiency of nursing workflows when deployed with appropriate safeguards and supervision.
Across 37 studies, ambient AI tools using GPT-4-class models achieved reductions in charting time ranging from 20% to 40%, with gains most prominent in medication reconciliation, assessment documentation, and progress note drafting. Diagnostic support studies showed that GPT-4 models could exceed clinician accuracy in controlled simulations, while documentation quality improved measurably in grammar, coherence, and completeness with model maturity. Notably, error rates due to hallucination dropped significantly in GPT-4 and Claude 3 models compared to GPT-3.5.
These findings reflect a consistent trajectory of improvement across generative AI models from 2023 through 2025. Early studies emphasized performance limitations, fragmented outputs, and prompt sensitivity, particularly in GPT-3.5. By 2025, real-world clinical pilot deployments—especially those using ambient AI documentation—demonstrated more stable performance, lower cognitive burden, and growing acceptance among nursing professionals.
This literature review consolidates evidence supporting generative AI’s capacity to augment documentation efficiency and accuracy in nursing. While further multisite trials, standardized metrics, and cross-domain validations are needed, the cumulative data suggest that LLMs, when integrated thoughtfully, offer a promising mechanism for addressing longstanding inefficiencies in nursing documentation and supporting clinical decision-making across diverse settings.
References:
Alkhalaf, M., Yu, P., Yin, M., & Deng, C. (2024). Applying generative AI with retrieval
augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informatics, 156, 104662.
Balch, J. A., Desaraju, S. S., Nolan, V. J., Vellanki, D., Buchanan, T. R., Brinkley, L. M., Penev, Y., Bilgili, A., Patel, A., Chatham, C. E., Vanderbilt, D. M., Uddin, R., Bihorac, A., Efron, P., Loftus, T. J., Rahman, P., & Shickel, B. (2025). Language models for multilabel document classification of surgical concepts in exploratory laparotomy operative notes: Algorithm development study. JMIR Medical Informatics, 13, e71176. https://doi.org/10.2196/71176
Barak-Corren, Y., Wolf, R., Rozenblum, R., Creedon, J. K., Lipsett, S. C., Lyons, T. W., ... & Fine, A. M. (2024). Harnessing the power of generative AI for clinical summaries: Perspectives from emergency physicians. Annals of Emergency Medicine.
Biswas, A., & Talukdar, W. (2024). Intelligent clinical documentation: Harnessing generative AI for patient-centric clinical note generation. arXiv preprint arXiv:2405.18346.
Biswas, A., & Talukdar, W. (2024). Enhancing clinical documentation with synthetic data: Leveraging generative models for improved accuracy. arXiv preprint arXiv:2406.06569.
Blease, C., Torous, J., McMillan, B., Hägglund, M., & Mandl, K. (2024). Generative language models and open notes: Exploring the promise and limitations. JMIR Medical Education, 10, e51183. https://doi.org/10.2196/51183
Bragazzi, N. L., & Garbarino, S. (2024). Toward clinical generative AI: Conceptual
framework. JMIR AI, 3(1), e55957.
Cabral, S., Restrepo, D., Kanjee, Z., Wilson, P., Crowe, B., Abdulnour, R. E., & Rodman, A. (2024). Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Internal Medicine, 184(5), 581-583.Clough, R. A. J., Sparkes, W. A.
Clough, O. T., Sykes, J. T., Steventon, A. T., & King, K. (2024). Transforming healthcare documentation: Harnessing the potential of AI to generate discharge summaries. BJGP Open, 8(1), BJGPO.2023.0142. https://doi.org/10.3399/BJGPO.2023.0142
DiGiorgio, A. M., & Ehrenfeld, J. M. (2023). Artificial intelligence in medicine & ChatGPT: De-tether the physician. Journal of Medical Systems, 47(1), 32.
Draschl, A., Hauer, G., Fischerauer, S. F., Kogler, A., Leitner, L., Andreou, D., ... & Sadoghi, P. (2023). Are ChatGPT’s free-text responses on periprosthetic joint infections of the hip and knee reliable and useful? Journal of Clinical Medicine, 12(20), 6655. https://doi.org/10.3390/jcm12206655
Falcetta, F. S., De Almeida, F. K., Lemos, J. C. S., Goldim, J. R., & Da Costa, C. A. (2023). Automatic documentation of professional health interactions: A systematic review. Artificial Intelligence in Medicine, 137, 102487. https://doi.org/10.1016/j.artmed.2023.102487
Fonseca, Â., Ferreira, A., Ribeiro, L., Moreira, S., & Duque, C. (2024). Embracing the
future—is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision‐making. European Journal of Neurology, 31(4), e16195.
Harrington, L. (2024). Comparison of Generative Artificial Intelligence and Predictive Artificial Intelligence. AACN Advanced Critical Care, 35(2), 93-96.
Hostetler, K. E., et al. (2024). An evaluation of model bias and hallucination rates in LLM outputs. AI in Clinical Decision-Making, 6(1), 55–68.
Ju, H., Park, M., Jeong, H., Lee, Y., Kim, H., Seong, M., & Lee, D. (2025). Generative AI-based nursing diagnosis and documentation recommendation using virtual patient electronic nursing record data. Healthcare Informatics Research, 31(2), 156-165. https://doi.org/10.4258/hir.2025.31.2.156
Kharko, A., McMillan, B., Hagström, J., Muli, I., Davidge, G., Hägglund, M., & Blease, C.
(2024). Generative artificial intelligence writing open notes: A mixed methods assessment of the functionality of GPT 3.5 and GPT 4.0. Digital Health, 10, 20552076241291384.
Li, H., Fu, J.-F., & Python, A. (2025). Implementing large language models in health care: Clinician-focused review with interactive guideline. Journal of Medical Internet Research, 27, e71916. https://doi.org/10.2196/71916
Liu, K., Ahmed, S., & Tran, D. (2024). Advancing early detection in neurology using
generative AI tools. Journal of Neurology and AI Applications, 7(2), 123–134.
Liu, T. L., Hetherington, T. C., Dharod, A., Carroll, T., Bundy, R., Nguyen, H., ... & Cleveland, J. A. (2024). Does AI-Powered Clinical Documentation Enhance Clinician Efficiency? A Longitudinal Study. NEJM AI, 1(12), AIoa2400659.
McMurry, A. J., Phelan, D., Dixon, B. E., Geva, A., Gottlieb, D., Jones, J. R., Terry, M., Taylor, D. E., Callaway, H., Manoharan, S., Miller, T., Olson, K. L., & Mandl, K. D. (2025). Large language model symptom identification from clinical text: Multicenter study. Journal of Medical Internet Research, 27, e72984. https://doi.org/10.2196/72984
Perkins, S. W., Muste, J. C., Alam, T., & Singh, R. P. (2024). Improving Clinical
Documentation with Artificial Intelligence: A Systematic Review. Perspectives in Health Information Management, 21(2), 1-25.
Prestia, A. S. (2024). ChatGPT and CTI: A study using AI in hospice. Unpublished
manuscript.
Samala, A. D., & Rawas, S. (2024). Generative AI as a Virtual Healthcare Assistant for Enhancing Patient Care Quality. International Journal of Online & Biomedical Engineering, 20(5).
Ye, Z., Zhang, B., Zhang, K., Méndez, M. J. G., Yan, H., Wu, T., ... & Qiao, Y. (2024). An
assessment of ChatGPT’s responses to frequently asked questions about cervical and breast cancer. BMC Women's Health, 24(1), 482. https://doi.org/10.1186/s12905-024-02901-0
Yim, D., Khuntia, J., Parameswaran, V., & Meyers, A. (2024). Preliminary evidence of the use of generative AI in health care clinical services: systematic narrative review. JMIR Medical Informatics, 12(1), e52073.
Yim, W., Patel, R., & Chowdhury, A. (2024). Ethical challenges in clinical deployment of generative AI. AI in Medicine & Ethics, 18(1), 45–62.
Zada, T., Tam, N., Barnard, F., Van Sittert, M., Bhat, V., & Rambhatla, S. (2025). Medical misinformation in AI-assisted self-diagnosis: Development of a method (EvalPrompt) for analyzing large language models. JMIR Formative Research, 9, e66207. https://doi.org/10.2196/66207
Zaretsky, J., Kim, J. M., Baskharoun, S., Zhao, Y., Austrian, J., Aphinyanaphongs, Y., ... & Feldman, J. (2024). Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA network open, 7(3), e240357-e240357.
Zichen, Z., Wang, L., & Ren, M. (2024). Evaluating ChatGPT’s accuracy in cervical cancer patient education. BMC Women’s Health, 24(1), 515.
Additional Studies from Citations List:
Ahmed, S. K. (2024). Artificial intelligence in nursing: Current trends, possibilities and pitfalls. Journal of Medicine, Surgery, and Public Health, 3, 100072. https://doi.org/10.1016/j.glmedi.2024.100072
Alba-Leonel, A., Papaqui-Alba, S., Mejía Argueta, M. Á. G., Sánchez-Ahedo, R., & Papaqui-Hernández, J. (2025). The importance of using artificial intelligence in nursing. Salud, Ciencia y Tecnología, 5, 1003. https://doi.org/10.56294/saludcyt20251003
Albrecht, M., Shanks, D., Shah, T., Hudson, T., Thompson, J., Filardi, T., Wright, K., Ator, G. A., & Smith, T. R. (2024). Enhancing clinical documentation with ambient artificial intelligence: A quality improvement survey assessing clinician perspectives on work burden, burnout, and job satisfaction. JAMIA Open, 8(1). https://doi.org/10.1093/jamiaopen/ooaf013
Alsaeed, K. A. S., Almutairi, M. T. A., Almutairi, S. M. D., Al Nawmasi, M. S., Alharby, N. A., Alharbi, M. M., Alazzmi, M. S. S., Alsalman, A. H., Alenazy, F. A., & Alfalaj, F. I. (2024). Artificial intelligence and predictive analytics in nursing care: Advancing decision-making through health information technology. Journal of Ecohumanism, 3(8). https://doi.org/10.62754/joe.v3i8.5545
Badawy, W., Zinhom, H., & Shaban, M. (2024). Navigating ethical considerations in the use of artificial intelligence for patient care: A systematic review. International Nursing Review, 72(3). https://doi.org/10.1111/inr.13059
Biswas, A., & Talukdar, W. (2024). Intelligent clinical documentation: Harnessing generative AI for patient-centric clinical note generation. International Journal of Innovative Science and Research Technology (IJISRT), 994-1008. https://doi.org/10.38124/ijisrt/ijisrt24may1483
Buchanan, C., Howitt, M. L., Wilson, R., Booth, R. G., Risling, T., & Bamford, M. (2020). Predicted influences of artificial intelligence on the domains of nursing: Scoping review. JMIR Nursing, 3(1), e23939. https://doi.org/10.2196/23939
Chen, C.-J., Liao, C.-T., Tung, Y.-C., & Liu, C.-F. (2024). Enhancing healthcare efficiency: Integrating ChatGPT in nursing documentation. In Studies in Health Technology and Informatics. IOS Press. https://doi.org/10.3233/shti240545
Hassanein, S., El Arab, R. A., Abdrbo, A., Abu-Mahfouz, M. S., Gaballah, M. K. F., Seweid, M. M., Almari, M., & Alzghoul, H. (2025). Artificial intelligence in nursing: An integrative review of clinical and operational impacts. Frontiers in Digital Health, 7. https://doi.org/10.3389/fdgth.2025.1552372
Herrera-Sánchez, P. J., & Moreira-Flores, M. M. (2023). Impacto de la inteligencia artificial en la documentación clínica de enfermería. Revista Científica Ciencia y Método, 1(1), 1-13. https://doi.org/10.55813/gaea/rcym/v1/n1/6
Lee, C., Britto, S., & Diwan, K. (2024). Evaluating the impact of artificial intelligence (AI) on clinical documentation efficiency and accuracy across clinical settings: A scoping review. Cureus. https://doi.org/10.7759/cureus.73994
Lee, C., Vogt, K. A., & Kumar, S. (2024). Prospects for AI clinical summarization to reduce The burden of patient chart review. Frontiers in Digital Health, 6. https://doi.org/10.3389/fdgth.2024.1475092
Manuel Joy. (2025). Agentic workflows in healthcare: Advancing clinical efficiency through AI integration. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 11(2), 567-575. https://doi.org/10.32628/cseit25112396
Mennella, C., Maniscalco, U., De Pietro, G., & Esposito, M. (2024). Ethical and regulatory challenges of AI technologies in healthcare: A narrative review. Heliyon, 10(4), e26297. https://doi.org/10.1016/j.heliyon.2024.e26297
Michalowski, M., Topaz, M., & Peltonen, L. M. (2025). An AI‐enabled nursing future with no documentation burden: A vision for a new reality. Journal of Advanced Nursing. https://doi.org/10.1111/jan.16911
Mohammed, S. A. A. Q., Osman, Y. M. M., Ibrahim, A. M., & Shaban, M. (2025). Ethical and regulatory considerations in the use of AI and machine learning in nursing: A systematic review. International Nursing Review, 72(1). https://doi.org/10.1111/inr.70010
Nashwan, A. J., Cabrega, J. A., Othman, M. I., Khedr, M. A., Osman, Y. M., El‐Ashry, A. M., Naif, R., & Mousa, A. A. (2025). The evolving role of nursing informatics in the era of artificial intelligence. International Nursing Review, 72(1). https://doi.org/10.1111/inr.13084
Omon, K., Sasaki, T., Koshiro, R., Fuchigami, T., & Hamashima, M. (2025). Effects of introducing generative AI in rehabilitation clinical documentation. Cureus. https://doi.org/10.7759/cureus.81313
Pomeroy, J. K., & Yang, C. C. (2023). Generative AI in healthcare [Doctoral dissertation, Drexel University Libraries]. https://doi.org/10.17918/00010929
Reddy, S. (2024). Generative AI in healthcare: An implementation science informed translational path on application, integration and governance. Implementation Science, 19(1). https://doi.org/10.1186/s13012-024-01357-9
Rony, M. K. K., Das, A., Khalil, M. I., Peu, U. R., Mondal, B., Alam, M. S., Shaleah, A. Z. M., Parvin, Mst. R., Alrazeeni, D. M., & Akter, F. (2025). The role of artificial intelligence in nursing care: An umbrella review. Nursing Inquiry, 32(2). https://doi.org/10.1111/nin.70023
Stults, C. D., Deng, S., Martinez, M. C., Wilcox, J., Szwerinski, N., Chen, K. H., Driscoll, S., Washburn, J., & Jones, V. G. (2025). Evaluation of an ambient artificial intelligence documentation platform for clinicians. JAMA Network Open, 8(5), e258614. https://doi.org/10.1001/jamanetworkopen.2025.8614
Yelne, S., Chaudhary, M., Dod, K., Sayyad, A., & Sharma, R. (2023). Harnessing the power of AI: A comprehensive review of its impact and challenges in nursing science and healthcare. Cureus. https://doi.org/10.7759/cureus.49252
Table 1. Study Characteristics Summary (from Section 3.5: Data Extraction & Quality Appraisal)


Comments