AI systems are increasingly being deployed in safety-critical health care situations. Yet these models sometimes hallucinate incorrect information, make biased predictions, or fail for unexpected reasons, which could have serious consequences for patients and clinicians.
In a commentary article published today in Nature Computational Science, MIT Associate Professor Marzyeh Ghassemi and Boston University Associate Professor Elaine Nsoesie argue that, to mitigate these potential harms, AI systems should be accompanied by responsible-use labels, similar to U.S. Food and Drug Administration-mandated labels placed on prescription medications.
MIT News spoke with Ghassemi about the need for such labels, the information they should convey, and how labeling procedures could be implemented.
Q: Why do we need responsible use labels for AI systems in health care settings?
A: In a health setting, we have an interesting situation where doctors often rely on technology or treatments that are not fully understood. Sometimes this lack of understanding is fundamental — the mechanism behind acetaminophen for instance — but other times this is just a limit of specialization. We don’t expect clinicians to know how to service an MRI machine, for instance. Instead, we have certification systems through the FDA or other federal agencies, that certify the use of a medical device or drug in a specific setting.
Importantly, medical devices also have service contracts — a technician from the manufacturer will fix your MRI machine if it is miscalibrated. For approved drugs, there are postmarket surveillance and reporting systems so that adverse effects or events can be addressed, for instance if a lot of people taking a drug seem to be developing a condition or allergy.
Models and algorithms, whether they incorporate AI or not, skirt a lot of these approval and long-term monitoring processes, and that is something we need to be wary of. Many prior studies have shown that predictive models need more careful evaluation and monitoring. With more recent generative AI specifically, we cite work that has demonstrated generation is not guaranteed to be appropriate, robust, or unbiased. Because we don’t have the same level of surveillance on model predictions or generation, it would be even more difficult to catch a model’s problematic responses. The generative models being used by hospitals right now could be biased. Having use labels is one way of ensuring that models don’t automate biases that are learned from human practitioners or miscalibrated clinical decision support scores of the past.
Q: Your article describes several components of a responsible use label for AI, following the FDA approach for creating prescription labels, including approved usage, ingredients, potential side effects, etc. What core information should these labels convey?
A: The things a label should make obvious are time, place, and manner of a model’s intended use. For instance, the user should know that models were trained at a specific time with data from a specific time point. For instance, does it include data that did or did not include the Covid-19 pandemic? There were very different health practices during Covid that could impact the data. This is why we advocate for the model “ingredients” and “completed studies” to be disclosed.
For place, we know from prior research that models trained in one location tend to have worse performance when moved to another location. Knowing where the data were from and how a model was optimized within that population can help to ensure that users are aware of “potential side effects,” any “warnings and precautions,” and “adverse reactions.”
With a model trained to predict one outcome, knowing the time and place of training could help you make intelligent judgements about deployment. But many generative models are incredibly flexible and can be used for many tasks. Here, time and place may not be as informative, and more explicit direction about “conditions of labeling” and “approved usage” versus “unapproved usage” come into play. If a developer has evaluated a generative model for reading a patient’s clinical notes and generating prospective billing codes, they can disclose that it has bias toward overbilling for specific conditions or underrecognizing others. A user wouldn’t want to use this same generative model to decide who gets a referral to a specialist, even though they could. This flexibility is why we advocate for additional details on the manner in which models should be used.
In general, we advocate that you should train the best model you can, using the tools available to you. But even then, there should be a lot of disclosure. No model is going to be perfect. As a society, we now understand that no pill is perfect — there is always some risk. We should have the same understanding of AI models. Any model — with or without AI — is limited. It may be giving you realistic, well-trained, forecasts of potential futures, but take that with whatever grain of salt is appropriate.
Q: If AI labels were to be implemented, who would do the labeling and how would labels be regulated and enforced?
A: If you don’t intend for your model to be used in practice, then the disclosures you would make for a high-quality research publication are sufficient. But once you intend your model to be deployed in a human-facing setting, developers and deployers should do an initial labeling, based on some of the established frameworks. There should be a validation of these claims prior to deployment; in a safety-critical setting like health care, many agencies of the Department of Health and Human Services could be involved.
For model developers, I think that knowing you will need to label the limitations of a system induces more careful consideration of the process itself. If I know that at some point I am going to have to disclose the population upon which a model was trained, I would not want to disclose that it was trained only on dialogue from male chatbot users, for instance.
Thinking about things like who the data are collected on, over what time period, what the sample size was, and how you decided what data to include or exclude, can open your mind up to potential problems at deployment.