Engineering Blog

ARTE Chatbot v2.0 - Passing the CPA Exam

Deferred's AI Tax Chatbot ARTE passes the CPA exam and outperforms human CPAs by 22%+ in benchmark tests.

LLM AI Tax Chatbot CPA Exam

Deferred is excited to announce that our AI Real estate Tax Expert (ARTE) achieves passing scores for the CPA exam and outperforms human test takers by 22.6%. ARTE also significantly outperforms baseline models on questions related to 1031 exchanges, a core use case for Deferred. We believe this makes ARTE the most accurate tax-oriented AI product on the market and the only one built specifically for real estate investors. 

Try ARTE here

Ask ARTE a question and see for yourself!

In the rest of this post, we cover:

  • First publicly available benchmark of GPT-4o on the CPA exam content
  • Benchmarking three Core CPA exams and one Discipline Exam
    • Auditing and Attestation (AUD)
    • Financial Accounting and Reporting (FAR)
    • Taxation and Regulation (REG)
    • Tax Compliance and Planning (TCP)
  • Benchmarking across a proprietary data set used to gauge 1031 Exchange tax expertise
    • In addition to multiple choice questions, Deferred has developed a novel data set focused on word problems and user data that reflects the complexity of real-life exchange scenarios. 

Why did we do this?

At Deferred, we specialize in helping real estate investors complete 1031 exchanges. This process can be opaque, complicated, and require working with a specialized team of exchange accommodators, tax attorneys and CPAs. It’s estimated that 94% of qualifying sales don’t take advantage of this incredible program, particularly when it comes to mom and pop investors, and our goal is to help make 1031 exchanges more accessible. 

We built our free, instant, and “always on” AI research assistant ARTE to make learning about, planning for, and executing a 1031 exchange easier. 

Passing the CPA exam

To ensure that ARTE presents accurate tax research, we wanted to benchmark performance using the most common professional exam in this space. We assembled thousands of questions from the Certified Public Account exam, focused on multiple choice questions, and spanning 4 exam areas. Our goal was to establish 1) ARTE’s absolute performance and 2) ARTE’s performance relative to human test takers studying for the CPA exam (CPA Candidates). 

About the CPA exam

The CPA exam is often considered one of the toughest professional exams in the world. 

It also has a complex grading system, combining multiple choice questions and task-based simulations. Each year, the exam evolves, with passing grades changing over time and published quarterly. For 2024, for many exam types a score of greater than 45% is considered passing, which speaks to how grueling the exam is. 

When it comes to our testing, we made some simplifying assumptions based on the data we had available. For more information see our section on caveats below. 

For CPA exams, we compared ARTE against unmodified, publicly available models provided by OpenAI, specifically GPT-3.5 and GPT-4o. We also compare performance against a data set of responses by CPA candidates. For 1031 exchange specific content, no CPA candidate data was available, we only assessed ARTE against OpenAI’s models. 

Results 

ARTE outperformed human test takers across all exams. ARTE outperformed baseline models in REG and FAR, and was comparable but slightly underperforming in AUD.

 

ARTE vs. GPT-3.5 and GPT-4o

To our knowledge, this is the first time GPT-4o has been publicly benchmarked on the CPA exam content. Interestingly, there is very little difference in performance between GPT-3.5 and GPT-4o with this data set. Our own ARTE outperforms both models significantly in FAR, REG, and TCP, but slightly underperforms both OpenAI models in AUD.

Previously, OpenAI has benchmarked GPT-3.5 and GPT-4 on other professional exams, and our results seem consistent with baseline models performing in the upper quartile of many professional exams. 

Notably, other attempts at benchmarking against the CPA exam made use of advanced prompting techniques like chain-of-thought prompting or few shot learning. While ARTE makes use of similar techniques, our assessment of OpenAI’s baseline models focused on establishing a rigorous benchmark without any prompting techniques to impact results. You can read more about this in the Methodology section at the end of the post.

We believe that the REG is the most relevant to our use case and AUD is the least relevant. While it is clear how we could improve our results on AUD, we also found it would come with some tradeoffs with respect to performance in other areas of the exam, and updating our knowledge index to improve results in AUD at the cost of other exam areas is not a worthwhile tradeoff. 

CPA Exam ARTE 2.0 GPT-4o GPT-3.5 Human
AUD-2024 83% 86% 86% 65%
FAR-2024 75% 62% 61% 63%
REG-2024 80% 75% 75% 63%
TCP-2024 73% 60% 60% 66%

ARTE vs. CPA candidates

Notably, ARTE outperforms CPA candidate test takers in all cases. 

On a relative basis, ARTE outperformed CPA candidates by an average of 24.6% on the Core Exams and 22.6% on a blended basis. 

CPA Exam ARTE 2.0 Human Delta
AUD-2024 83% 65% +128%
FAR-2024 75% 63% +119%
REG-2024 80% 63% +127%
TCP-2024 73% 66% +111%

1031 Exchange Expertise

About the 1031 exchange test suite

Section 1031 of the tax code can be complex and opaque when trying to work through a complicated scenario. We have developed a novel data set, rooted in real-life exchange scenarios, to assess performance and establish benchmarks for these types of cases. We use these “word problem” style questions in combination with multiple choice question sets to create a testing suite focused on expertise with 1031 exchange. 

As an example, here is sample test question from our new data set. 

Taxpayer inherited a 25% undivided interest in Parcel #1 from her mother, who had received it as a gift from Taxpayer's father. Taxpayer's father had originally acquired Parcel #1, Parcel #2, and Parcel #3 for income-producing and investment purposes. After Taxpayer's father's death, Parcel #1 was transferred to Taxpayer's mother, and Parcels #2 and #3 were transferred to a Trust for the benefit of Taxpayer's mother during her lifetime, with Taxpayer and her three siblings as equal remainder beneficiaries. The Trust and the siblings decided to sell all their land holdings, including Parcels #1, #2, and #3. However, Taxpayer preferred to retain ownership of real estate. To accommodate this, the parties agreed that Taxpayer would exchange her 25% interest in Parcel #1 for a 100% interest in Parcel #3, with both properties having equal fair market values. Following the exchange, the Trust and the siblings sold Parcels #1 and #2 to an unrelated third party. Determine whether this exchange and the subsequent sale of Parcel #1 by the Trust and siblings triggers any gain recognition for Taxpayer under Section 1031(f).

We believe a good answer to a complex question like this should be correct, but also explicit in the reasoning behind the answer, and should include references to relevant sources to aid in understanding the research and vetting the answer.

ARTE vs GPT-3.5 and GPT-4o

ARTE consistently outperforms both baseline models across all categories of this data set. Of note, there are cases where GPT-3.5 outperforms GPT-4o in some categories.

More importantly, when it comes to the word problems, ARTE significantly outperforms the baseline models. 

We believe these results with respect to word problems are significant for a few reasons. First, in production use, there are an infinite number of potential answers to a user's query, which is much harder to navigate compared to selecting an answer from a choice set. Second, these types of questions represent real-world scenarios that combine multiple types of issues, which is much harder than a multiple choice question targeting specific knowledge in a specific subject area.  

1031 Category ARTE 2.0 GPT-4o GPT-3.5
Tax code, Regulations, and Other Clarifications 91% 81% 81%
Qualifying properties 96% 74% 79%
Ownership considerations 91% 81% 82%
1031 Exchange Process 92% 81% 79%
Types of Sales & Exchanges 83% 83% 73%

Further Research

Going forward, we’re excited to extend ARTE to cover more real estate tax concepts. We also think there are a number of improvements we can make to our novel testing data sets that will unlock another major improvement in performance gains relative to baseline models. We’re also interested in exploring improvements to the user experience, including fine-tuning driven improvements to shape ARTE’s response and some novel UX patterns to collect (and respond with) structured data relevant for real estate investors. 

Methodology

To perform this assessment, we built an internal testing platform that - 

  • Uses a standardized format for test cases
  • Uses a slightly modified system prompt to encourage providing a specific answer option in the case of multiple choice questions
  • Uses a combination of code-based and fine-tuned LLM pipelines to extract a specific answer option in the case of multiple choice questions
  • Uses model-graded scoring rubrics for open-response questions. We use a modified version of OpenAI’s own model-graded Factual Evaluation Template that adds few-shot prompting and uses GPT-4o as the grading model. 

OpenAI Model Configuration

To establish benchmarks, we ran our test suite using our internal platform against the relevant OpenAI models. These models did not have any custom system prompt, adjustments to the default parameters like temperature, or additional context provided in the prompt. 

CPA Exam Caveats

When it comes to our testing, we made some simplifying assumptions based on the data we had available. 

  • Focus on multiple choice questionssome text
    • We have carved out task-based simulation questions. This presents an area for further research, but would require improvements to our grading systems to accurately assess performance. 
  • Focus on unweighted performancesome text
    • While the CPA exam weights the value of questions based on difficulty, we did not have access to these weights during grading. We assume each question is equally weighted when grading. 
  • Comparing performance against CPA candidatessome text
    • For our test question bank, we were able to assemble performance data for CPA candidates. This serves as a more direct comparison relative to human performance on the exam from a representative sample of test takers. 
  • Focus on content with highest relevancy for our use casesome text
    • In addition to the three Core exams (AUD, FAR, REG) we added one Discipline exam (TCP) for exploration. We believe TCP is the most relevant exam for our use case. 
    • Other Discipline exams (BAR, ISC) represent opportunities for further research. 

CPA Candidate Answers

Human performance was gathered from CPA practice exam questions. Each question had a varying number of attempts based on the topical area, with more common questions receiving more attempts based on a more frequent rotation in practice exams. The following histogram shows the distribution of attempts per question.