Engineering Blog

ARTE Chatbot v0.1 - Initial Release

The initial release of ARTE, our Advanced Real estate Tax Expert, offers a free and instant way to get answers to your 1031 exchange questions with incredible accuracy.

1031 exchange FAQ chatbot - Deferred.com

Deferred.com’s AI chatbot, the Advanced Real-estate Tax Expert (ARTE), is a state of the art large language model driven chatbot trained to be an expert on 1031 exchanges and related real estate tax matters.

To date, ARTE has passed the following professional certification and continuing education courses:

ARTE has also passed a comprehensive set of internal benchmarks gauging accuracy across a number of topical areas related to 1031 exchanges and across varying questions types. Below we dive into the details of our process, publish our scores across our benchmark, and compare our results against the most widespread, free, and publicly available consumer model.

Performance Benchmarks

Topical Expertise

To assess ARTE’s performance over time, we’ve developed a set of internal tests we can use for benchmarking. The benchmarks are categorized by topical areas judged to be most important when answering deep technical tax questions as they relate to real estate investors performing 1031 exchanges.

For each topical area, a score of 70% is considered passing for a licensed professional. We hold ARTE to a high standard, requiring a score that is greater than what would be expected of a typical professional working in the field.

Topical area Key topics Benchmark Score
Tax code, Regulations, and Other Clarifications Basic terminology and structure related to IRC Section 1031
Federal Tax Code and Regulations
IRS Safe Harbor Rules, Clarifications and Other Rulings
Calculating cost basis, boot and gain
State and federal tax terms
Case law
Tax Filing Requirements
87.67%
1031 Exchange Process Requirements for an exchange
Timing requirements
Identification requirements
Flow of funds and Constructive Receipt
91.67%
Ownership considerations Individual, joint, and spousal ownership
Partnership, corporation and limited liability company concerns
Disregarded entities
Fractional ownership interests (TICs / DSTs)
Related party transactions
Common vesting issues
Dealer and developer status
100.00%
Qualifying properties Qualifying and non-qualifying property
Like-kind property types
Property usage and conversion of use
Mixed-use property
Primary residence considerations
81.82%
Types of Sales & Exchanges Forward Exchanges
Reverse Exchanges
Simultaneous Exchanges
Construction or Improvement exchange issues
Multiple Property Exchanges
Combination exchanges
Installment sales (Section 453)
Involuntary conversions (Section 1033)
Mortgage and Financing Considerations
90.63%
History and Evolution of 1031 exchanges History and Evolution of 1031 exchanges 64.00%

Performance by question type

ARTE’s performance is currently benchmarked against two types of questions - Multiple Choice and Open Response.

Question Type ARTE Benchmark Score
Multiple Choice 84.52%
Open Response 83.33%

Comparison with Public Models

Topical Expertise

To understand our performance for our initial release, we’ve run a comparison against Open AI’s GPT3.5 model, which is broadly available and free to use for consumers in ChatGPT at the time of release.

ARTE significantly outperforms the the public models on our internal benchmarks. With a score of 70% required to pass, the public model fails our internal test and is much more likely to provide inaccurate or misleading tax advice when it comes to 1031 exchanges.

Topical area ARTE Benchmark Score Open AI GPT3.5 Delta
Tax code, Regulations, and Other Clarifications 87.67% 68.49% +128%
1031 Exchange Process 91.67% 58.33% +157%
Ownership considerations 100.00% 75.00% +133%
Qualifying properties 81.82% 54.55% +150%
Types of Sales & Exchanges 90.63% 43.75% +207%
History and Evolution of 1031 exchanges 64.00% 56.00% +114%

Performance by question type

When comparing ARTE’s performance to GPT3.5 by question type, we significantly outperform.

Unsurprisingly, public model benchmarks have a much higher chance of performing well in a multiple choice evaluation given the bounded potential answers (though in this case still failed to achieve a passing grade).

However, with an open response question, the increased likeliness to hallucinate absent specific expertise in the subject matter leads to a much lower benchmark score and highlights ARTE’s outperformance. When dealing with complex questions that, when wrong, can lead to thousands or hundreds of thousands of dollars in tax liability, this difference is significant.

Question Type ARTE Benchmark Score GPT3.5 Benchmark Score Delta
Multiple Choice 84.52% 59.52% 142%
Open Response 83.33% 50% 167%

Methodology

Training data

ARTE is trained on public documents deemed relevant for 1031 exchanges. This includes the Internal Revenue Code, regulations, IRS rulings, case law, and other material used to provide guidance or set precedent when it comes to evaluating whether or not a 1031 exchange is qualifying and a taxpayer can successfully defer their capital gains.

Courses

When determining if ARTE can pass a professional certification course, the course material is run through an evaluator using the same model configuration as our chatbot. Course materials have not been included in the training data to prevent bias based on recalling specific course content to answer the question. Once an evaluation is run, a human in the loop translates the evaluation output into the testing platform and submits the test to determine if a passing score was achieved.

Benchmark & Evaluation Details

Benchmarks

Topical areas within the Benchmarks are not weighted in any way, which may skew results based on number of questions in each topical area.

There are a number of ways we’re interested in improving our benchmarks over time -

  • Increase the number of open response questions
  • Increase the number of questions in certain topical areas
  • Introduce a weighting of importance across topical areas to generate a more accurate overall score.

We’re also interest in assessing working professionals against our test set and establishing a baseline for the benchmark that we can compare ourselves to.

Evaluations

Multiple Choice questions include 4 potential answers and are graded programmatically based on the letter answer output.

Open Response questions are compared with an expected answer provided by subject matter experts and are model graded based on Open AI’s model-graded Factual Evaluation template. Scores are assigned accordingly:

Evaluation Model Grades Score
Correct and complete (B) The submitted answer is a superset of the expert answer and is fully consistent with it. 1
Correct and complete (C) The submitted answer contains all the same details as the expert answer. 1
Correct and partially complete (A) The submitted answer is a subset of the expert answer and is fully consistent with it. 0.5
Incorrect (D) There is a disagreement between the submitted answer and the expert answer. 0