LLM Benchmarks 

February 2025

We focused on the Enterprise RAG Challenge in the past months, so this benchmark hasn’t seen that much progress as we expected. The work on an interactive table and more test cases will happen in the next releases.

Still, we have quite a few topics to cover:

  • AI coding tests imported into benchmark

  • OpenAI: o3-mini and GPT-4.5

  • Anthropic: Claude 3.7 with reasoning and without

  • Qwen: QwQ 32B, Qwen Max, Qwen Plus

  • Crisis of OpenAI SDK as a common standard for LLM APIs

  • Insights from the Enterprise RAG Challenge

Let’s get started!

Here is how the benchmark looks like at this moment. It tests abilities of LLMs to handle business and enterprise tasks that we have extracted from industry cases. It presents models with challenging tasks, but also allows them to benefit from the reasoning (by running custom chain-of-thought routines within the prompt).

At the moment OpenAI is at the top, closely followed up by Anthropic and DeepSeek models. It is amazing to see locally-capable models within the TOP-5.

It is even more amazing to see a 70B model at the 5th place. If the rumours of NVIDIA RTX PRO 6000 Blackwell are true, one would need only one GPU card like this to run this model at native (to GPU) fp8 quantisation, while still having some spare space for KV caches and larger parallel contexts.

New-AI-Coding-Tasks-imported-into-the-benchmark

As we’ve mentioned earlier, this benchmark is not full, yet. We are still in the process of reviewing AI cases in our portfolio and importing them as new tests into the benchmark.

Here is an example of a fintech code comprehension benchmark. This test (one of many other tests) sounds like this:

Which specs would fail, if I add a new feature: after authorizing any transaction larger than 3000, the system automatically blocks the card due to “Large Transaction Risk.” We do not add new event type, just extend the existing command handler. 

Source code: %SOURCE CODE%

By the way, take a note that we put the question before the contextual data in the prompt. This completely spoils the caching semantics, but makes the job easier for the models and improves the overall accuracy.

Source code for this specific task is a class implementing credit card state management. It comes with a list of executable event-driven specifications that verify all the behaviours. Entire file is 1100 lines of dense Python code.

All models are given a custom chain of thought routine that they can leverage within the prompt response. Here is, for example, how o1 correctly handles this tricky question:

{
  "short_thought_steps": [
    "Check specs that authorizeTransaction with amounts > 3000",
    "No spec has an authorization > 3000",
    "Hence no test scenario triggers the new block logic",
    "No spec fails"
  ],
  "failing_spec_names": []
}

And here is how the request is handled by Sonnet 3.7 (without reasoning):

{
    "short_thought_steps": [
        "1. The new feature will automatically block a card when a transaction over 3000 is authorized",
        "2. This means the authorize_transaction method will now have a side effect of blocking the card",
        "3. Need to check all specs that authorize transactions over 3000",
        "4. Also need to check specs that expect certain events after authorization",
        "5. The spec 'authorize_transaction_success' authorizes a transaction of 2000, which is below the threshold",
        "6. No other specs explicitly authorize transactions over 3000",
        "7. However, any spec that expects only a TransactionAuthorized event after authorizing a large transaction would fail",
        "8. Let me create a test case in my mind: if we authorize a 4000 transaction, we'd expect both TransactionAuthorized and CardBlocked events",
        "9. No existing specs test this scenario directly"
    ],
    "failing_spec_names": []
}

Coding tasks like that are a good proxy for understanding capabilities of models in software engineering. We will continue importing our AI cases into this benchmark to make it more relevant and representative.

Now, let’s get to benchmarking actual models.

OpenAI: o3-mini and GPT-4.5

o3-mini model from OpenAI took the first place of our Benchmark (reasoning effort: medium). It currently is the best model available. It scored a perfect 100% in our current coding benchmark, 45% on BI benchmark and 70% on compliance benchmark.

Recently released GPT-4.5 model doesn’t have reasoning. It is very thoughtful and capable of understanding nuances, but it didn’t score that high, ending with 51% aggregate score.

Here is, for example, how this model handled (incorrectly) the coding task from above. It just didn’t think things through to the end:

{
  "short_thought_steps": [
    "Identify specs involving transaction authorization.",
    "Check if transaction amounts exceed 3000.",
    "List specs that would fail due to the new blocking rule."
  ],
  "failing_spec_names": [
    "authorize_transaction_success"
  ]
}

Anthropic: Claude 3.7 Sonnet with reasoning and without

Anthropic models are usually praised by the developers and users for their pleasant and productive conversation format. However Anthropic wasn’t really successful with advanced models recently (anybody remembers Opus?) until recently.

New Claude 3.7 Sonnet with thinking scored 70% on our benchmark (and a perfect 100% coding score), taking 2nd place. Its sibling model (without reasoning) took 8th place.

We were running Claude 3.7 Sonnet in reasoning mode with a budget of 25k tokens and 80% of that dedicated to reasoning. It also had a few reasoning slots within the answer budget.

What makes the victory sweeter for Anthropic - their models still don’t have a proper Structured Output mode (where schema is guaranteed in 100% of the cases by constrained decoding), yet the manage to respond with a correctly parseable JSON schema in every single case of this benchmark.

Note, that your mileage with Anthropic may vary. For example, in some production cases, we are seeing JSON Schema error rates around 5-10%, which usually requires running schema repair routine.

Qwen Models

We have a bunch of new Qwen models in our benchmark, but will focus only on the two outstanding ones: Qwen Max and QwQ 32B.

Qwen Max is large-scale MoE model. It has decent scores, placing it on the 12th place. It is comparable with Microsoft Phi-4 and Claude 3.5 Sonnet.

Qwen QwQ-32B is a locally capable model that pushes the state of the art in our benchmark. This model raises the bar of the accuracy achievable by 32B models. The previous record was also held by a Qwen model (qwen-2.5-32B instruct), but the new model is a proper reasoning model.

We tested QwQ-32B model by running it via OpenRouter from the Fireworks provider (selected, because they support StructuredOutputs). This experience cost us a few days of frustration and highlighted a general problem with the ecosystem of reasoning models.

Crisis of OpenAI SDK as a common standard for LLM APIs

At the moment of writing OpenAI API and Python SDK are a de-facto standard for interfacing with different models. It isn’t the best standard, but nothing else did stick. Within the ecosystem you can use one library to access various servers: vLLM, ollama. Nvidia makes OpenAI compatible frontend for the Triton Inference server. Even Google, after a long deliberation said that they are going to be OpenAI compatible.

Things work most of the time, until they don’t.

For instance, when calling QwQ-32B model through OpenAI SDK library (which is guaranteed to be working by OpenRouter), we were getting Error: 'NoneType' object is not iterable error from deep inside the OpenAI SDK.

Root cause was a bit complex.

Modern reasoning models spend more tokens to think through the problem before providing answers. OpenAI reasoning models hide this thinking context, while open-source models make this explicit, by annotating with <think> tags in the response.

Now, OpenRouter probably tried to be smart and convenient. They are automatically putting think tokens into a special “reasoning” field of the response, while putting actual answer into “content”. That is alone enough to mess things up in cases, where the model runs with StructuredOutput constrained decoding. Note, that the reasoning includes a proper JSON answer, while the content is empty:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "",
        "reasoning": "{ \"chain_of_thought\": [ \"To determine how many ...",
        "refusal": null,
        "role": "assistant"
      },
      "native_finish_reason": "stop"
    }
  ],

It took some time to identify the root cause, implement a custom client (fixing a problem like that on the fly), only to hit another edge case.

In a few percent of the cases, Qwen Qwq-32B, when queried through OpenRouter from Fireworks, would return a response like this one, where thinking output is text and the final content is also text, without any Structured Output in sight:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "```json\n{\n  \"short_thought_steps...```",
        "reasoning": "Okay, let me figure...",
        "refusal": null,
        "role": "assistant"
      },
      "native_finish_reason": "stop"
    }
  ],

Long story short, the ecosystem around standardised querying of reasoning models is currently a mess. There are too many gaps that leave interpretation open for the implementing parties. This leads to weird edge cases.

Coincidentally, OpenAI has just introduced its new Response API. It is designed to replace Completions API and perhaps could bring in a better standard for working with reasoning LLMs.

Insights from the Enterprise RAG Challenge round 2 (ERCr2)

As you may have heard, we have just finished running a round 2 of our Enterprise RAG Challenge. We had more than 350 registrations before the competition. The challenge involved building a RAG system to answer 100 questions about 100 different annual reports of companies. The largest report had more than 1000 pages!

We have received a lot of submissions and more than a 100 filled in AI surveys, where teams explained in greater detail their aproaches and architectures. This made ERCr2 a big crowd-sourced research experiment on building precise Enterprise RAGs.

Here are some of the early insights that we have discovered together with the community.

First of all, quality of data extraction is important for the overall accuracy. Winning RAG solutions used the Docling document extraction library.

Second insight was: if you have a good architecture, you can get high quality results even when using a tiny local LLM.

Take a look at the winning architecture: PDF parsing with heavily modified Docling library + Dense retrieval + Router + Parent Document Retrieval + SO CoT + SO reparser

When evaluated with different LLM models, it obtained the following scores:

o3-mini        R: 83.8 │ G: 81.8 
llama3.3-70b   R: 83.9 │ G: 72.8 
llama-3.1 8b   R: 81.1 │ G: 68.7 

R - Retrieval score
G - Generation score

As you can see, less capable is the model - lower are the scores. But given the high quality of the Retrieval, the generation score doesn’t drop as fast as we would’ve expected!

In other words, answer accuracy would be 68.7% using a good overall architecture and llama-3.1 8B. If we replaced it with 70B model, it would increase only by 4%. And if we replaced the local 70B model with the best reasoning model, Enterprise RAG Accuracy improves only by 9%.

This scoring was extracted from our benchmarks and applied to the Enterprise RAG Challenge. It turned out to be a good proxy for the potential quality that could be achieved with a given RAG Architecture. RAG accuracy depends on the quality of the retrieval component. But if the final RAG score is way lower than the quality of the retrieval - there are some low-hanging fruits that could be picked to improve the precision.

Our third insight: reasoning patterns on top of Structured Outputs and Custom Chains of Thought were used by all top solutions. They work even in cases, where Structured Outputs are not supported by the LLM serving (via constrained decoding) and schema must be repaired first.

Our fourth insight: vector search is not completely dead, yet. A high-quality dense retrieval mechanism was used to power the best solution. Although the second best solution (in prize leaderboard) didn’t use vectors.

But in general, one thing that we’ve learned - small locally-capable models are capable of achieving a lot more than we initially thought. Open source LLMs are already a good competition to proprietary models, especially if you are using SO/CoT patterns to push them to their limits.

Perhaps, in year 2025 we will see a wider adoption of AI business systems that run locally and no longer compromise quality for the safety.

Transformieren Sie Ihre digitalen Projekte mit den besten KI-Sprachmodellen!

Entdecken Sie die transformative Kraft der besten LLM und revolutionieren Sie Ihre digitalen Produkte mit KI! Bleiben Sie zukunftsorientiert, steigern Sie die Effizienz und sichern Sie sich einen klaren Wettbewerbsvorteil. Wir unterstützen Sie dabei, Ihren Business Value auf das nächste Level zu heben.

* required

Wir verwenden die von Ihnen an uns gesendeten Angaben nur, um auf Ihren Wunsch hin mit Ihnen Kontakt im Zusammenhang mit Ihrer Anfrage aufzunehmen. Alle weiteren Informationen können Sie unseren Datenschutzhinweisen entnehmen.

Please solve captcha!

captcha image
Martin Warnung
Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80
Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Insights

These are the proud winners of the Enterprise RAG Challenge

Discover the winners of the Enterprise RAG Challenge! Explore top RAG solutions, watch the official announcement, and see how AI-driven retrieval and LLMs shaped the best-performing models.

Insights

Team-Leaderboard of the Enterprise RAG Challenge

The team-leaderboard includes all submitted entries – including those submitted after the Ground Truth was released. Therefore, we consider this ranking an unofficial overview.

Blog 3/11/25

Answering Business Questions with LLMs

8th place in Enterprise RAG Challenge 2025: Answering Business Questions with LLMs

Wissen 4/30/24

GPT & Co: The best language models for digital products

Our analysis based on real benchmark data reveals which solutions excel in document processing, CRM integration, external integration, marketing support and code generation. Find your ideal model!

Blog

How I Won the Enterprise RAG Challenge

In this article, Ilia Ris describes the approach that helped him achieve first place in both prize categories and the overall SotA leaderboard.

Blog 6/27/23

Boosting speed of scikit-learn regression algorithms

The purpose of this blog post is to investigate the performance and prediction speed behavior of popular regression algorithms, i.e. models that predict numerical values based on a set of input variables.

Kompetenz

Business Innovation & Digital Transformation

The Pressure to increase efficiency and reduce costs is increasing ► Are you familiar with this? Exploit the potential of digitalization

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 12/19/22

Creating a Cross-Domain Capable ML Pipeline

As classifying images into categories is a ubiquitous task occurring in various domains, a need for a machine learning pipeline which can accommodate for new categories is easy to justify. In particular, common general requirements are to filter out low-quality (blurred, low contrast etc.) images, and to speed up the learning of new categories if image quality is sufficient. In this blog post we compare several image classification models from the transfer learning perspective.