Categories
Writing

Some Challenges Facing Physician AI Scribes

Recent reporting from the Associated Press highlights the potential challenges in adopting emergent generative AI technologies into the working world. Their reporting focused on how American health care providers are using OpenAI’s transcription tool, Whisper, to transcribe patients’ conversations with medical staff.

These activities are occurring despite OpenAI’s warnings that Whisper should not be used in high-risk domains.

The article reports that a “machine learning engineer said he initially discovered hallucinations in about half of the over 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper. The problems persist even in well-recorded, short audio samples. A recent study by computer scientists uncovered 187 hallucinations in more than 13,000 clear audio snippets they examined.”

Transcription errors can be very serious. Research by Prof. Koenecke and Prof. Sloane of the University of Virgina found:

… that nearly 40% of the hallucinations were harmful or concerning because the speaker could be misinterpreted or misrepresented.

In an example they uncovered, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”

But the transcription software added: “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”

A speaker in another recording described “two other girls and one lady.” Whisper invented extra commentary on race, adding “two other girls and one lady, um, which were Black.”

In a third transcription, Whisper invented a non-existent medication called “hyperactivated antibiotics.”

While, in some cases, voice data is deleted for privacy reasons this can impede physicians (or other medical personnel) from double checking the accuracy of transcription. While some may be caught, easily and quickly, more subtle errors or mistakes may be less likely to be caught.

One area where work stills needs to be done is to assess the relative accuracy of the AI scribes versus that of physicians. While there may be errors introduced by automated transcription what is the error rate of physicians? Also, what is the difference in quality of care between one whom is self-transcribing during a meeting vs reviewing transcriptions after the interaction? These are central questions that should play a significant role in assessments of when and how these technologies are deployed.