本文来自作者[admin]投稿,不代表永利号立场,如若转载,请注明出处:http://www.siyonli.com/zshi/202506-1111.html
除了多项选择的诊断问题外,评估Pathchat和其他MLLM的能力是有价值的 ,它对与开放式病理相关的询问产生一致,合理和临床相关的回答(在方法中的“专家疗法病理学问题的基准”中的基准 ”)。根据PathQabench-Public的病例,董事会认证的解剖病理学家仔细策划了针对广泛主题的开放式问题,包括显微镜图像描述 ,组织学等级和分化状态,危险因素,预后 ,治疗,诊断,IHC测试 ,分子改变和其他测试 。与多项选择评估一样,为了模仿病理AI助手的现实世界用例,每个问题都按原样提供给模型 ,而没有任何进一步的模型或特定于任务的微调。
鉴于评估对开放式问题的回答的更主观性质,我们的评估由两个组成部分组成。首先,七位专家病理学家分别对所有问题(图3A)的反应(从最佳到最糟糕的联系)进行了排名(图3A) ,它们基于它们与问题,正确性以及是否以简洁的方式补充了正确的解释或推理(请参阅“ MLLM评估”中的“ MLLM评估”(有关方法和扩展数据图)的“ MLLM评估”。3-5的模型 。在整个排名过程中,以前与任何模型没有相互作用的病理学家也对哪种响应产生了哪些响应。此外,每个问题的响应都被随机改组 ,以最大程度地减少对特定模型的潜在偏差。评估的这一部分旨在捕获对回应的广泛专家判断(包括主观人类的偏爱) 。
总体而言,我们发现,与所有其他测试的MLLM相比 ,Pathchat平均产生的更可取,更高的响应。当考虑由人类专家判断的模型排名的头对头记录(例如,Pathchat与GPT-4V)时 ,对Pathchat的“胜利 ”,这是一个等同于Pathchat的回答严格排名的问题,比其对应者的响应严格高。同样 ,Pathchat的“平局”意味着这两个模型获得了相同的等级,而“损失”意味着Pathchat的排名严格较低 。在亚军GPT-4V方面,Pathchat的中位数获胜率为七个独立病理学家评估人员的中位数率为56.5% ,而中位损失率仅为22.3%,中位平局率为21.2%(图3B和补充表12和13)。与LLAVA 1.5相比,我们再次观察到了更大的性能差距(中位数为67.7%,中位数损失率为11.2% ,中位损失率为11.2%,平均领带率为21.5%)和LLAVA-MED(中位数获胜率为74.2%,中位数的损失率为10.0% ,中位数为10.0%,平局中位数为15.4%)。
此外,为了在开放式问题上为每个模型的准确性建立一个更客观的指标 ,两位经过董事会认证的病理学家对每个问题都独立审查了回答 。他们为每个模型分配了正确与不正确的二进制标签(同时对每个模型的身份视而不见)。为了减轻主观性的程度,两位病理学家随后讨论了他们在评估中不同意的所有问题,以实现共识。对于260个问题中的235个 ,就所有模型都达成了完全的共识,我们将共识作为基础真理来计算每个模型的准确性 。具体而言,Pathchat在开放式问题的子集中得分为78.7% ,病理学家能够达成共识(图3C和补充表14),该问题对应于26.4%(P)(P< 0.001) compared to the accuracy of 52.3% achieved by the runner-up, GPT-4V. Compared to the publicly available general-purpose MLLM LLaVA 1.5 (accuracy of 29.8%) and the biomedicine-specialized MLLM LLaVA-Med (accuracy of 30.6%), the margin of improvement was even more substantial, at +48.9% and +48.1%, respectively (P < 0.001 for both). We show the accuracy of each model as assessed by each pathologist on the full set of questions (including the remaining questions for which disagreement remained) in Extended Data Fig. 6.
These results demonstrate that overall, PathChat generated both more accurate as well as more preferable responses to diverse pathology-related queries. Additionally, to better understand the relative strengths and weaknesses of the different models, we analysed their performance for various subgroups of questions (described in Supplementary Tables 15 and 16 with examples provided in Extended Data Fig. 7). In particular, the microscopy category includes questions that test the ability of models to generate accurate and detailed morphological descriptions of histology microscopy images and assess clinically relevant features such as tumour differentiation and grade. Questions in the diagnosis category tested the ability of the models to directly suggest a reasonable diagnosis based on the histology image available and relevant clinical context (unlike the multiple-choice questions for which possible choices are provided). The clinical questions tested the ability to retrieve clinically relevant background knowledge about the disease in question, including risk factors, prognosis and treatment. Ancillary testing questions tested the ability of the models to suggest further testing, such as IHC and molecular workups, to confirm a specific diagnosis or inform prognosis and treatment.
Although GPT-4V was the runner-up to PathChat overall, PathChat’s responses were especially superior to those of GPT-4V in the categories that require examination of the histology image (microscopy and diagnosis), for which the accuracies on the consensus subset were 73.3% and 78.5% for PathChat respectively versus 22.8% and 31.6% for GPT-4V (Fig. 3d and Supplementary Tables 17–19). Similarly, the median head-to-head win rate against GPT-4V reached 70.6% and 71.3% on these two categories of questions, respectively, compared to the average median win rate of 57.4%. Coupled with a median lose rate against GPT-4V of only 13.8% on both these categories, the results imply that PathChat was better than or as good as GPT-4V in around 86% of queries that emphasize histology image examination (Extended Data Figs. 8 and 9 and Supplementary Tables 20–27). On the other side, we found that PathChat lagged somewhat behind GPT-4V on clinical and ancillary testing, for which, for the consensus subset, PathChat achieved a respectable 80.3% accuracy on both categories compared to GPT-4V’s higher scores of 88.5% and 89.5% on the two categories, respectively. Note that although PathChat convincingly outperformed GPT-4V in accuracy on the microscopy and diagnosis categories according to the consensus (P < 0.001 for both, n = 101 and 79, respectively), we did not find any statistical significance (P >0.05)对于临床和辅助测试类别的GPT-4V的较高准确性:根据共识,临床和辅助测试的p = 0.291(n = 61),P = 0.153(n = 76) ,这表明这些类别在PATHCHATCHATCHATCHATCHATCHATCHATCHATCHATCHATCHATCHATCHAT和RUNSCHAT-um-um-um-um-um-um-um-umnnv-4V之间的性能可能没有意义。Similarly, according to the more subjective ranking-based evaluation, we found that PathChat was comparable to and in fact slightly more preferred by the panel of pathologists compared to GPT-4V (a median win rate of 44.1% and lose rate of 33.8% versus GPT-4V for clinical and a median win rate of 44.8% and lose rate of 35.6% for ancillary testing) on these same categories.
请注意,我们包括临床和辅助测试问题,以全面评估AI助手模型解决与病理相关的查询的功能。但是 ,这些问题经常不需要对组织学图像进行实际检查,而是主要目的是测试模型回忆与病理学相关的背景知识的能力(例如,“疾病X中通常发现哪些特定的分子改变 ,以及它们如何影响预后或治疗选择? ”)。结果,即使是通用的多模式AI助手,例如Llava 1.5 ,通常也可以充分回答这些类别中的问题,而且GPT-4V尤其可以在这里脱颖而出,因为它大概是更大的 ,并且在互联网上比开放式模型和开源模型更广泛的知识 。由于这些查询通常可以通过常规查询手段(例如互联网搜索或咨询参考手册)轻松解决,因此,我们专注于显微镜和诊断类别,作为将不同模型作为病理学视觉助理效用的主要指标 ,鉴于对于其他两个类别,AI援助不一定需要基于病理学图像来回答AI辅助。补充表28-38中包括了子类别的模型性能进一步细分。请注意,即使我们用于回答开放式问题的基准是特定于病理学的 ,但它的规模大约是早期工作中使用的140个问题的两倍,其中人类专家评估了LLMS编码一般临床知识的能力 。
最后,请注意 ,就像我们在多项选择评估中的观察一样,在提交的260个问题中,GPT-4V显然拒绝回答38 ,这大概是因为其中实施了护栏。每个问题最多进行了三次尝试(有关更多详细信息,请参见方法中的“评估GPT-4V”)。与我们对其他模型的评估一致,所有GPT-4V反应 ,无论它们是否成功,都盲目,洗牌并呈现给病理学家,以进行评估 ,而无需特殊治疗 。但是,对于透明度,我们记录了每个问题类别中GPT-4V的最终不成功的查询数量(补充表39) ,并且仅报告了GPT-4V成功回答的问题的子集(补充表40-64)(补充表40–64),这是Pathchat仍然超过20%的GPT-4V(pathChat save for 20%)的示例(59%),以示例为20%。共识的问题 ,p <0.001)。
赞 (12)
评论列表(3条)
我是永利号的签约作者“admin”
本文概览: 除了多项选择的诊断问题外,评估Pathchat和其他MLLM的能力是有价值的,它对与开放式病理相关的询问产生一致,合理和临床相关的回答(在方法中的“专家疗法病理学问题的基准...
文章不错《人类病理学多模式生成的AI副本》内容很有帮助