Stanford Research Reveals AI Legal Tools Susceptible to Generating Inaccurate Data

Thu Jun 13, 2024 - 9:40am GMT+0000

As technology continues to permeate every aspect of professional life, the legal sector is not left behind. Large language models (LLMs) have become integral to tasks that demand heavy information processing, with several companies introducing specialized tools that employ LLMs alongside information retrieval systems to facilitate legal research. These advancements promise a revolution in how legal professionals access and utilize information.

The Promise of AI in Legal Research
Legal research tools powered by artificial intelligence are touted to transform the legal field by providing quicker access to case laws, statutes, and legal precedents. Companies such as LexisNexis and Thomson Reuters have developed AI-assisted research tools that are designed to streamline the process of legal inquiry and response.

However, a new study conducted by Stanford University researchers presents a critical view of these technological advancements. The study, described by the authors as the first “preregistered empirical evaluation of AI-driven legal research tools,” examines the effectiveness of these tools in handling over 200 manually constructed legal queries. The findings reveal that, although AI tools exhibit reduced rates of hallucinations—instances where the AI generates false information—they still occur at significant rates. This poses a substantial challenge to the reliability of these tools in practical legal settings.

Understanding Hallucinations in AI
Hallucination in AI occurs when the system produces information that is not just incorrect but demonstrably false. For legal AI tools, this could mean fabricating case details or misinterpreting legal statutes, which can lead to erroneous legal advice or conclusions. The Stanford study indicates that even with advancements like retrieval-augmented generation (RAG), where the AI fetches relevant documents to inform its responses, the problem persists.

The Limitations of Retrieval-Augmented Generation
RAG is considered a gold standard among enterprises aiming to curb AI hallucinations. Unlike basic LLMs that depend solely on trained data, RAG systems augment the AI’s response capabilities by retrieving pertinent documents to provide a contextual basis for its outputs. However, legal queries often involve complex, multifaceted issues that may not have straightforward answers or documented precedents, making accurate retrieval challenging.

Key Findings from the Stanford Study:

The tested AI tools showed a hallucination rate of 17-33% across different queries.
Legal AI tools performed better than general-purpose AI systems but still struggled with high rates of inaccuracies.
The complexity of legal queries often exceeds the retrieval capabilities of current AI technologies, leading to potential errors.
Implications for Legal Practice
Despite these challenges, AI-driven tools can still enhance legal research. They serve as valuable starting points that can reduce the initial time spent searching through vast amounts of legal documents. However, the study highlights the need for cautious reliance on these tools, emphasizing the importance of attorney oversight in interpreting and verifying AI-generated information.

Calls for Transparency and Benchmarking
One of the study’s significant advocacies is for greater transparency in the legal AI sector. The closed nature of these AI systems makes it difficult for legal professionals to ascertain when and how to trust their outputs. The researchers argue for public benchmarking of AI tools to foster an environment of trust and reliability, critical to their adoption in sensitive fields like law.

Industry Response
In response to the study, Mike Dahn, head of Westlaw Product Management at Thomson Reuters, expressed support for benchmarking AI solutions but was surprised by the high rate of inaccuracies reported. Thomson Reuters has conducted extensive internal testing and argues that the discrepancies might stem from the nature of the queries used in the Stanford study, which do not commonly arise in regular use of their AI tools.

Pablo Arredondo, VP of CoCounsel at Thomson Reuters, also commented on the study, applauding the initiative and expressing eagerness to explore the findings further. Discussions are underway to develop a consortium involving universities, law firms, and legal tech firms to establish and maintain benchmarks for legal AI applications.

Conclusion
While AI in legal research offers significant potential benefits, the Stanford study underscores the need for ongoing scrutiny and improvement. As AI tools become more embedded in legal practices, ensuring their reliability and accuracy remains paramount. The legal industry’s future will likely depend on achieving a balance between technological innovation and the rigorous standards of legal accuracy and reliability.