LLM-Benchmarks August 2024

AI Architecture Benchmarks

August 2024

This month we have prepared something special for you. Instead of benchmarking just the LLMs we present you a first benchmark of various AI architectures.

This was done as a first round of our Enterprise RAG Challenge. Within that challenge we worked together with individual consultants and some vendors of commercial AI solutions.

Industry Overview

First of all, we have mapped all proven cases of successful AI application (that are known to us) on a single map by industry and impact area.

Afterwards, we have reviewed the entire portfolio and identified any recurring themes that persist even between the industry and application boundaries. There were a few:

A lot of cases of successful application of AI in business are about using ChatGPT with a couple of simple LLM patterns: checklists, routers and knowledge maps. It can be surprising how much value can be achieved with just a few prompts and lines of code.
Most successful cases don’t act as standalone systems, but rather integrate into existing processes as copilots and assistants. Sometimes they are even invisible to the end users.
If we look at the numbers along, the most popular AI case is about building “AI Search” or “AI Assistants” for the business.

AI Search or AI Assistants are examples of use cases where a company wants a system that can provide intelligent answers based on files and documents. This is the most popular case and sometimes an entry point into AI for companies.

This is also one of the most controversial types of the project. A popular opinion is to implement a solution like that using vector databases and RAG systems. However, even if you stick to that opinion (which we don’t), there are so many different LLMs, frameworks and architectural nuances to pick from.

So how would one implement an AI Assistant over company documents?

Enterprise RAG Challenge

To answer that question in a collaborative manner, we have setup an Enterprise RAG Challenge. This is a friendly competition to test accuracy of different RAG systems in business workloads. It goes like this:

Participants build a system that can answer questions about uploaded PDF documents (annual reports). Or they can test their existing AI Assistant system that is already built.

Anybody can participate. Anonymous participation is also possible. All we ask - to share some details about your RAG implementation, for the benefit of the community. We would like to learn what works better in practice and share that with everybody.

When the competition starts:

Participants get in advance a set of annual reports as PDFs. They can take some time to process them.
List of questions for these files is generated. Within a few minutes (to avoid manual processing) participants will need to produce answers and upload them.
Afterwards the answers are checked in public and compiled into a public dataset.

You will be able to compare performance of different teams and technologies (if team decided to answer a few questions) within a single table. We will also compile and publish a report at the end of the challenge.

You can read more about the competition on Github. The description there is somewhat geeky, since we went to a great length to make sure that the competition is fair for everybody.

Round 1

At the end of the summer we started a first trial run.

All information about the first round is publicly available on our Github page under the Apache license. Code of the question generator, file selector, rand seed selector and ranking - also available. Team submissions, too.

The teams received 20 Annual Reports in PDF form and were expected to automatically generate responses to questions like:

Which company had a higher total assets: "MITSUI O.S.K. LINES", "ENRG ELEMENTS LIMITED" or "First Mid Bancshares, Inc.", in the fiscal year 2021?

or:

What was the free cash flow of "Österreichische Kontrollbank" in the financial year 2023?

The last question is actually a trick question to test hallucinations. Österreichische Kontrollbank report has covered only year 2022. It is expected for models to refuse answering and return N/A in such cases.

A complete list of questions and the original annual reports can be found in the Github repository.

We got 17 submissions in total with some teams participating anonymously. Teams shared their architectures, LLM models and sometimes even more details:

Let’s review the table a bit:

Best solution - Checklist with GPT-4o

The highest scoring solution is from Daniel Weller. It scored 84 (out of max 100 points). Daniel is a colleague from TIMETOACT GROUP Austria.

ℹ️ We have taken great care to ensure that all participants compete under the same conditions (please read the the description on Github for more details) and to make the competition fair for everyone. For transparency, we will explicitly mark the affiliation to TIMETOACT in the TTA column.

In addition, some competitors also participate in the AI research program or benefit from its findings. These participants are marked in the AIR column for transparency.

Daniel has agreed to publish the source code for his solution. As soon as it is available, we will update the Github repository with the links. The status of the source code release can be seen in the source column.

Under the hood Daniel’s solution uses GPT-4o model with structured outputs. During pre-fill phase it benefits from the fact that possible types of questions were shared publicly with all the participants in a form of public question generator code. So we prepare a checklist with possible types of information to extract, enforce the data types with Structured Outputs and run against all documents to extract necessary information. Large documents are split based on the size.

During the question answering phase we go through each question and pass it to GPT-4o together with the pre-filled checklist data. Resulting answer is shaped into the proper schema by using structured outputs again.

The solution was a bit on the expensive side. Information prefill for 20 PDF consumes almost 6 dollars, while answering 40 questions took $2.44.

In this challenge we don’t place any limits on the cost of the solution, but encourage participants to capture and share cost data. Readers can then prioritise resulting solutions based on their own criteria.

Second Best - Classic RAG with GPT-4o

The second best solution came from Ilya Rice. It scored 76 points, achieving it with GPT-4o and a classical Langchain-based RAG. It used one of the best embedding models - text-embedding-3-large from OpenAI and custom Chain of Thought Prompts. The solution used fitz for text parsing, while chunking texts by character count.

Third Best Solution - Checklists with Gemini Flash

Third best solution was provided by Artem Nurmukhametov. His solution was architecturally similar to the solution of Daniel, but used multi-stage processing for the checklists. It used Gemini Flash model from Google to drive the system.

The solution was also on the expensive side, consuming $4 for the full challenge run.

As you have noticed 2 out of 3 top solutions have used Checklist pattern and Knowledge Mapping to benefit from the fact that the domain is already known in advance. While this is a common case in businesses (we can use Domain-Driven Design and iterative product development to capture similar level of detail), this puts classical RAG-systems at a disadvantage.

To compensate for that, in the next round of the Enterprise RAG Challenge, we will rework the question generator to have a lot more variability, making it prohibitively more expensive to “cheat” by simply using Knowledge mapping.

Best On-Premise Solution

As you have noticed, most of the solutions have used GPT-4o LLM from the OpenAI. According to our benchmarks, this is one of the best and most cost-effective LLMs currently available.

However, in the real world companies are sometimes interested in solutions that can run completely on the premises. This can be desired due to various reasons: cost, IP protection or compliance.

Locality comes at some cost - local models like Llama are less capable than the cloud-based models like OpenAI GPT-4 or Claude Sonnet 3.5. To compensate for that, local AI systems start leveraging advanced techniques that are sometimes possible only for the local models - precise guidance, fine-tuning (full tine-tuning, not the adapters that OpenAI employs), using custom mixtures and ensembles of experts or wide beam search.

It can be hard to compare effective accuracy of drastically different approaches. This Enterprise RAG Challenge allows to start comparing them against the same basis.

6th place is taken by a fully local system with a score of 69. Gap between this system and the winner is much less than what we expected!

Under the hood this system uses Qwen-72B LLM which is quite popular in some parts of Europe and Asia. Overall architecture is based on ReAct agent loops from LangChain with RAG-driven query engine. Table data from PDFs was converted to XML and the RecursiveCharacterTextSplitter was used for text chunking.

The table has two other solutions that can run fully on-the-premises. These are marked with a ⭐ in the "Local" column.

Round 2 - This Fall

The first round was done within a small circle of peers, to test-drive and polish the experience. The reception was much better than we expected.

We are planning to host next round of Enterprise RAG challenge later this fall. This round will be announced publicly, it will include a few small balance changes:

Question generator will be rebalanced to produce fewer questions that result in N/A answer. We’ll still keep a few around to catch hallucination cases.
We will generate more questions and ensure bigger diversity of possible questions. This will make the competition more challenging for the approaches based on Knowledge Mapping approach and Checklist LLM Pattern.

All changes will be made public and shared as open source before the start of the competition. Every participant will be able to use that knowledge to prepare for that competition.

In addition to that, source code of solutions from TIMETOACT GROUP will be shared for everybody to benefit from.

We will also try to gather more data from the participants and make it more consistent.

All of that should make the results from the next round more valuable, helping to push our shared understanding of what does it take to build high-quality AI solutions for the enterprise in practice.

Strategic Outlook

We are heading into the end of the summer holidays and a new busy period for the business. What can we expect from the future months in the world of “LLMs for the Enterprise”?

First, of all architectural approaches for solving customer problems will continue evolving. As we have seen from the RAG Challenge, there isn’t a single best option that clearly beats everybody. Radically different architectures are currently competing: solutions based on Knowledge Mapping, classical vector-based RAGs, systems with dedicated agents and knowledge graphs.

By looking at the architecture alone, it is not possible to tell in advance, if it will be the best solution. Number of lines of code is not a clear indicator either.

Based on the architecture alone, there is still room for the improvement of the quality in LLM-driven solutions.

However, LLM patterns and practices will not be the only factor driving future quality improvements. Let’s not forget that Large Language Models are continuously getting better and cheaper.

ℹ️ If you look at forum responses and online presence, ChatGPT and Anthropic Claude Chat keep on getting worse. Especially in the free tiers. However, what people frequently forget - these are user-facing products that are used for field-testing new versions of Large Language Models.

Companies are motivated to make the LLMs running underneath as cheap as possible. And that is exactly what OpenAI has done in recent years.

For the most part, companies use fixed, stable models via the API. These models have a predictable quality and do not suddenly deteriorate.

Let’s look at the progression of “LLM performance you can get for your money” throughout the time. We’ll show a chart that demonstrates that, based on the scores from our LLM Leaderboard.

In this chart we will group models not by their marketing names, but their provider and cost tier.

Here we can see an interesting pattern. For the same amount of money at different points in time we were able to get different accuracy.

In the first half of 2023 companies started releasing good models. Everybody started leveraging them and talking. After grabbing a share of the market, companies switched to cost-saving mode - releasing new less capable versions within the same tier. We wrote about that in multiple LLM Leaderboard reports.

Starting from the 2024, when even Google has joined the AI race, companies started working on the model quality again. They are releasing new models that work better for the same amount of money.

The progress looks quite steady so far, it repeats across multiple LLM vendors. This makes us believe that LLMs will continue improving their “bang-for-buck” ratio in the next 6 months as well.

What does this mean? It is a good time to be building LLM-driven systems to help businesses create move value. They already work nicely, but they will continue getting even better - both from the architectural improvements and from the releases of more capable LLMs.

We’ll continue tracking both perspectives in our monthly LLM Leaderboard.

Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

First Name *

Last Name *

Company *

E-Mail *

Phone number

Message *

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.

Martin Warnung

Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80

Contact

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Insights

LLM Benchmarks March 2025

What's new in the world of LLMs? Find out and read why Google DeepMind managed to surprise us more than once last month.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Insights 3/17/25

LLM Benchmarks: February 2025

Discover the latest insights from our independent LLM benchmarks for February 2025. Find out which large language models performed best.