Deferred.com’s AI chatbot, the Advanced Real-estate Tax Expert (ARTE), is a state of the art large language model driven chatbot trained to be an expert on 1031 exchanges and related real estate tax matters.
To date, ARTE has passed the following professional certification and continuing education courses:
- Section 1031 Real Property Like-Kind Exchanges (CPA Continuing Professional Education)
- Concepts and Mechanics of Exchanges (IRS Enrolled Agnes & CPA Continuing Professional Education)
- Real Estate Closings and 1031 Exchange (Continuing Legal Education for Attorneys)
- Advanced 1031 Exchange Concepts and Opportunity Zones (Continuing Legal Education for Attorneys and American Institute of Professional Bookeepers)
- Like-Kind Exchanges and Delaware Statutory Trusts Under Section 1031 (Continuing Legal Education for Attorneys)
- Recording and Accounting for Asset Exchanges (General Professional Education)
- Recording and Accounting for Asset Exchanges (Continuing Education for National Association of State Boards of Accountancy)
ARTE has also passed a comprehensive set of internal benchmarks gauging accuracy across a number of topical areas related to 1031 exchanges and across varying questions types. Below we dive into the details of our process, publish our scores across our benchmark, and compare our results against the most widespread, free, and publicly available consumer model.
Performance Benchmarks
Topical Expertise
To assess ARTE’s performance over time, we’ve developed a set of internal tests we can use for benchmarking. The benchmarks are categorized by topical areas judged to be most important when answering deep technical tax questions as they relate to real estate investors performing 1031 exchanges.
For each topical area, a score of 70% is considered passing for a licensed professional. We hold ARTE to a high standard, requiring a score that is greater than what would be expected of a typical professional working in the field.
Performance by question type
ARTE’s performance is currently benchmarked against two types of questions - Multiple Choice and Open Response.
Comparison with Public Models
Topical Expertise
To understand our performance for our initial release, we’ve run a comparison against Open AI’s GPT3.5 model, which is broadly available and free to use for consumers in ChatGPT at the time of release.
ARTE significantly outperforms the the public models on our internal benchmarks. With a score of 70% required to pass, the public model fails our internal test and is much more likely to provide inaccurate or misleading tax advice when it comes to 1031 exchanges.
Performance by question type
When comparing ARTE’s performance to GPT3.5 by question type, we significantly outperform.
Unsurprisingly, public model benchmarks have a much higher chance of performing well in a multiple choice evaluation given the bounded potential answers (though in this case still failed to achieve a passing grade).
However, with an open response question, the increased likeliness to hallucinate absent specific expertise in the subject matter leads to a much lower benchmark score and highlights ARTE’s outperformance. When dealing with complex questions that, when wrong, can lead to thousands or hundreds of thousands of dollars in tax liability, this difference is significant.
Methodology
Training data
ARTE is trained on public documents deemed relevant for 1031 exchanges. This includes the Internal Revenue Code, regulations, IRS rulings, case law, and other material used to provide guidance or set precedent when it comes to evaluating whether or not a 1031 exchange is qualifying and a taxpayer can successfully defer their capital gains.
Courses
When determining if ARTE can pass a professional certification course, the course material is run through an evaluator using the same model configuration as our chatbot. Course materials have not been included in the training data to prevent bias based on recalling specific course content to answer the question. Once an evaluation is run, a human in the loop translates the evaluation output into the testing platform and submits the test to determine if a passing score was achieved.
Benchmark & Evaluation Details
Benchmarks
Topical areas within the Benchmarks are not weighted in any way, which may skew results based on number of questions in each topical area.
There are a number of ways we’re interested in improving our benchmarks over time -
- Increase the number of open response questions
- Increase the number of questions in certain topical areas
- Introduce a weighting of importance across topical areas to generate a more accurate overall score.
We’re also interest in assessing working professionals against our test set and establishing a baseline for the benchmark that we can compare ourselves to.
Evaluations
Multiple Choice questions include 4 potential answers and are graded programmatically based on the letter answer output.
Open Response questions are compared with an expected answer provided by subject matter experts and are model graded based on Open AI’s model-graded Factual Evaluation template. Scores are assigned accordingly: