The best language models for digital products

LLM Benchmarks | May 2024

Based on real benchmark data from our own software products, we evaluated the performance of different LLM models in addressing specific challenges. We examined specific categories such as document processing, CRM integration, external integration, marketing support, and code generation.  


LLM Benchmarks | May 2024

Our benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama2 license

A more detailed explanation of the respective categories can be found below the table.

modelcodecrmdocsintegratemarketingreasonfinal 🏆CostSpeed
GPT-4o ☁️8595100908275881.24 €1.49 rps
GPT-4 Turbo v5/2024-04-09 ☁️809998938845842.51 €0.83 rps
GPT-4 v1/0314 ☁️808898528850767.19 €1.26 rps
GPT-4 Turbo v4/0125-preview ☁️6097100717545752.51 €0.82 rps
GPT-4 v2/0613 ☁️808395528850747.19 €2.07 rps
Claude 3 Opus ☁️6488100537659734.83 €0.41 rps
GPT-4 Turbo v3/1106-preview ☁️607598528862722.52 €0.68 rps
Gemini Pro 1.5 0514 ☁️6796751002562712.06 €0.91 rps
Gemini Pro 1.5 0409 ☁️629796637528701.89 €0.58 rps
GPT-3.5 v2/0613 ☁️627973758148700.35 €1.39 rps
GPT-3.5 v3/1106 ☁️626871637859670.24 €2.29 rps
Gemini 1.5 Flash 0514 ☁️3297100567241660.10 €1.76 rps
GPT-3.5 v4/0125 ☁️588571607847660.13 €1.41 rps
Gemini Pro 1.0 ☁️558683608826660.10 €1.35 rps
Cohere Command R+ ☁️587776497059650.85 €1.88 rps
Qwen1.5 32B Chat f16 ⚠️648782567815641.02 €1.61 rps
GPT-3.5-instruct 0914 ☁️449069608832640.36 €2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅568667528826620.37 €2.99 rps
Meta Llama 3 8B Instruct f16🦙746068498042620.35 €3.16 rps
GPT-3.5 v1/0301 ☁️497569678224610.36 €3.93 rps
Starling 7B-alpha f16 ⚠️516667528836600.61 €1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅467272498831600.51 €2.14 rps
Claude 3 Haiku ☁️596964557533590.08 €0.53 rps
Mixtral 8x22B API (Instruct) ☁️47626294757580.18 €3.01 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅517472417531570.36 €3.05 rps
Claude 3 Sonnet ☁️674174527830570.97 €0.85 rps
Mistral Large v1/2402 ☁️334970758425562.19 €2.04 rps
Anthropic Claude Instant v1.2 ☁️517565596514552.15 €1.47 rps
Anthropic Claude v2.0 ☁️575255458435552.24 €0.40 rps
Cohere Command R ☁️396357558426540.13 €2.47 rps
Qwen1.5 7B Chat f16 ⚠️518160346036540.30 €3.62 rps
Anthropic Claude v2.1 ☁️365859607533532.31 €0.35 rps
Qwen1.5 14B Chat f16 ⚠️445851498417510.38 €2.90 rps
Meta Llama 3 70B Instruct b8🦙467253298218507.32 €0.22 rps
Mistral 7B OpenOrca f16 ☁️425776217826500.43 €2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️317069446221500.79 €1.39 rps
Llama2 13B Vicuna-1.5 f16🦙363753398238481.02 €1.07 rps
Llama2 13B Hermes f16🦙382330616043421.03 €1.06 rps
Llama2 13B Hermes b8🦙322429616043424.94 €0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️10586551568410.19 €2.17 rps
Mistral Small v2/2402 ☁️27353682568410.19 €3.14 rps
IBM Granite 34B Code Instruct f16 ☁️52493044575401.12 €1.46 rps
Llama2 13B Puffin f16🦙371238485641394.89 €0.22 rps
Mistral Medium v1/2312 ☁️363027596212380.83 €0.35 rps
Llama2 13B Puffin b8🦙37937465639378.65 €0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️13395740598360.05 €2.30 rps
Llama2 13B chat f16🦙15381745758330.76 €1.43 rps
Llama2 13B chat b8🦙15381545756323.35 €0.33 rps
Mistral 7B Zephyr-β f16 ✅28344644294310.51 €2.14 rps
Llama2 7B chat f16🦙203320425020310.59 €1.86 rps
Mistral 7B Notus-v1 f16 ⚠️16432541484300.80 €1.37 rps
Orca 2 13B f16 ⚠️152232226719290.99 €1.11 rps
Microsoft Phi 3 Mini 4K Instruct f16 ⚠️36242617508270.95 €1.15 rps
Mistral 7B Instruct v0.2 f16 ☁️7215013588261.00 €1.10 rps
Mistral 7B f16 ☁️0442425212250.93 €1.17 rps
Orca 2 7B f16 ⚠️1302418524190.81 €1.34 rps
Llama2 7B f16🦙0218328291.01 €1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

How well can the model work with large documents and knowledge bases?

How well does the model support work with product catalogs and marketplaces?

Can the model easily interact with external APIs, services and plugins?

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

How well can the model reason and draw conclusions in a given context?

Can the model generate code and help with programming?

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Deeper insights

Google Gemini 1.5 - Pro and Flash

The recent Google IO announcement was all about AI and Gemini Pro. Although Google managed to forget about Gemini Ultra (it was supposed to come out early this year), they did update a few models.


Gemini Pro 1.5 0514 trades off some document comprehension capabilities in favour of better reasoning. It comes out as a slightly better version than the previous Gemini Pro version.

In our experience it was also a bit buggy at this point. We’ve encountered a bunch of server errors and even managed to get “HARM_CATEGORY_DANGEROUS_CONTENT” flag on one of the business benchmarks.


Gemini Pro 1.5 scored a perfect "Integrate" score, where we measure LLM's ability to reliably follow instructions and work with external systems, plugins and data formats.


Gemini 1.5 Flash is an interesting new addition to the family. It works well in document tasks and has decent reasoning capability. Combine that with a very low price, and you get a good alternative to GPT-3.5.

Just note Google Gemini's peculiar pricing model - they don't charge text by tokens, but by billable symbols(Unicode code points minus spaces). We have seen several developers and even SaaS systems make mistakes when estimating costs.

GPT-4o - Clearly in the lead, but with a caveat

GPT-4o looks perfect at first glance. It is faster and cheaper than GPT-4 turbo. It also has 128K context, scores higher, has native multi-modality and understands languages better.

Under the hood it has a new tokeniser with a bigger dictionary. It leads to reduced token counts.

Overall the model scores didn’t jump much higher, because we are already operating at the limit of our LLM Benchmark. There is just one gotcha. Our “Reason” category (ability of models to handle complex logical and reasoning tasks) was made inherently difficult. GPT-4o managed to increase score from 62 (GPT-4 Turbo v3/1106-preview) to 75.

What is the caveat?

You see, OpenAI seems to operate in cycles. They switch between: “let’s make a better model” and “let’s make a cheaper model without sacrificing quality too much”.

While the LLM Benchmarks don’t catch it, it feels like the GPT-4o model belongs to the cost reduction models. It works amazingly well on small prompts, however other benchmarks demonstrate that it is not as good in dealing with larger contexts, like the other GPT-4 models. It also feels lacking in reasoning, even though current benchmarks are not capable of catching this regression.


For the time being, GPT-4 Turbo v5/2024-04-09 is our recommended go-to model.

Qwen 1.5 Chat


Due to the high demand and good ratings in the LMSYS Arena, we decided to benchmark some flavours of Alibaba Cloud's Qwen Chat model.


Qwen 1.5 32B Chat is quite good. It is within the range of GPT-3.5 models and Gemini Pro 1.0. It comes with a non-standard license, though.

We have also tested Qwen 1.5 7B and 14B - they are quite good for their relative size. Nothing peculiar, just a decent performance.


The license of Qwen1.5 Chat is a Chinese equivalent of Llama 3: you can use it freely for commercial purposes, if you have less than 100M MAU. This might make model adoption tricky in the USA and EU.

IBM Granite 34B Code Instruct

That is the only peculiar thing about their models, though. While the previous versions of IBM Granite models were available only within the IBM Cloud, the model under test was published directly to Hugging Face.

However, that is the only special thing about their models. While the previous versions of the IBM Granite models were only available within the IBM Cloud, the tested model was published directly on Hugging Face.

Long story short, IBM Granite 34B Code Instruct has a decent code capability (for a 7B model) and bad results in pretty much everything else. If you need a local coding model at a fraction of compute cost, just pick Llama3 or one of its derivatives.

LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!

Learn more

Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.

Christoph HasenzaglChristoph HasenzaglBlog

Common Mistakes in the Development of AI Assistants

How fortunate that people make mistakes: because we can learn from them and improve. We have closely observed how companies around the world have implemented AI assistants in recent months and have, unfortunately, often seen them fail. We would like to share with you how these failures occurred and what can be learned from them for future projects: So that AI assistants can be implemented more successfully in the future!

Christoph HasenzaglChristoph HasenzaglBlog

8 tips for developing AI assistants

AI assistants for businesses are hype, and many teams were already eagerly and enthusiastically working on their implementation. Unfortunately, however, we have seen that many teams we have observed in Europe and the US have failed at the task. Read about our 8 most valuable tips, so that you will succeed.


Standardized data management creates basis for reporting

TIMETOACT implements a higher-level data model in a data warehouse for TRUMPF Photonic Components and provides the necessary data integration connection with Talend. With this standardized data management, TRUMPF will receive reports based on reliable data in the future and can also transfer the model to other departments.

Headerbild zu IBM Cloud Pak for Data Accelerator

IBM Cloud Pak for Data Accelerator

For a quick start in certain use cases, specifically for certain business areas or industries, IBM offers so-called accelerators based on the "Cloud Pak for Data" solution, which serve as a template for project development and can thus significantly accelerate the implementation of these use cases. The platform itself provides all the necessary functions for all types of analytics projects, and the accelerators provide the respective content.

Martin LangeMartin LangeBlog
Checkliste als Symbol für die verschiedenen To Dos im Bereich Lizenzmanagement

License Management – Everything you need to know

License management is not only relevant in terms of compliance but can also minimize costs and risks. Read more in the article.

Headerbild zu Cloud Pak for Data – Test-Drive

IBM Cloud Pak for Data – Test-Drive

By making our comprehensive demo and customer data platform available, we want to offer these customers a way to get a very quick and pragmatic impression of the technology with their data.

Headerbild zu IBM Watson Knowledge Studio

IBM Watson Knowledge Studio

In IBM Watson Knowledge Studio, you train an Artificial Intelligence (AI) on specialist terms of your company or specialist area ("domain knowledge"). In this way, you lay the foundation for automated text processing of extensive, subject-related documents.

Headerbild zu IBM Watson Discovery

IBM Watson Discovery

With Watson Discovery, company data is searched using modern AI to extract information. On the one hand, the AI uses already trained methods to understand texts; on the other hand, it is constantly developed through new training on the company data, its structure and content, thus constantly improving the search results.

Headerbild zu IBM Watson Assistant

IBM Watson Assistant

Watson Assistant identifies intention in requests that can be received via multiple channels. Watson Assistant is trained based on real-live requests and can understand the context and intent of the query based on the acting AI. Extensive search queries are routed to Watson Discovery and seamlessly embedded into the search result.


Modernes Business Intelligence und Data Warehouse System

IBM Cloud Pak for Data System enables healthcare group AGAPLESION to effectively manage data and perform complex analyses.


Interactive online portal identifies suitable employees

TIMETOACT digitizes several test procedures for KI.TEST to determine professional intelligence and personality.


Managed service support for optimal license management

To ensure software compliance, TIMETOACT supports FUNKE Mediengruppe with a SAM Managed Service for Microsoft, Adobe, Oracle and IBM.


Flexibility in the data evaluation of a theme park

With the support of TIMETOACT, an theme park in Germany has been using TM1 for many years in different areas of the company to carry out reporting, analysis and planning processes easily and flexibly.

Header Konnzeption individueller Business Intelligence Lösungen

Conception of individual Analytics and Big Data solutions

We determine the best approach to develop an individual solution from the professional, role-specific requirements – suitable for the respective situation!

Headerbild zu Operationalisierung von Data Science (MLOps)

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.


Cloud Transformation & Container Technologies

Public, private or hybrid? We can help you develop your cloud strategy so you can take full advantage of the technology.


Software, Mobile and Web App Development

Standard software often cannot completely fulfill a company's own requirements - TIMETOACT therefore develops customized software solutions.


Application Integration & Process Automation

Digitizing and improving business processes and responding agilely to change – more and more companies are facing these kind of challenges. This makes it all the more important to take new business opportunities through integrated and optimized processes based on intelligent, digitally networked systems.


Managed Service: Mailroom

In the TIMETOACT mailroom, business documents are converted into data in a highly efficient manner and returned securely to the end customer for further processing.

Headerbild zu Digitale Planung, Forecasting und Optimierung

Demand Planning, Forecasting and Optimization

After the data has been prepared and visualized via dashboards and reports, the task is now to use the data obtained accordingly. Digital planning, forecasting and optimization describes all the capabilities of an IT-supported solution in the company to support users in digital analysis and planning.

Teaserbild zu Data Integration Service und Consulting

Data Integration, ETL and Data Virtualization

While the term "ETL" (Extract - Transform - Load / or ELT) usually described the classic batch-driven process, today the term "Data Integration" extends to all methods of integration: whether batch, real-time, inside or outside a database, or between any systems.

Headerbild zu Data Governance Consulting

Data Governance

Data Governance describes all processes that aim to ensure the traceability, quality and protection of data. The need for documentation and traceability increases exponentially as more and more data from different sources is used for decision-making and as a result of the technical possibilities of integration in Data Warehouses or Data Lakes.

Headerbild IBM Cloud Pak for Data

IBM Cloud Pak for Data

The Cloud Pak for Data acts as a central, modular platform for analytical use cases. It integrates functions for the physical and virtual integration of data into a central data pool - a data lake or a data warehouse, a comprehensive data catalogue and numerous possibilities for (AI) analysis up to the operational use of the same.

Headerbild zu IBM Cloud Pak for Automation

IBM Cloud Pak for Automation

The IBM Cloud Pak for Automation helps you automate manual steps on a uniform platform with standardised interfaces. With the Cloud Pak for Business Automation, the entire life cycle of a document or process can be mapped in the company.

Haderbild zu IBM Cloud Pak for Application

IBM Cloud Pak for Application

The IBM Cloud Pak for Application provides a solid foundation for developing, deploying and modernising cloud-native applications. Since agile working is essential for a faster release cycle, ready-made DevOps processes are used, among other things.

Headerbild IBM Cloud Pak for Data System

IBM Cloud Pak for Data System

With the Cloud Pak for Data System (CP4DS), IBM provides the optimal hardware for the use of all Cloud Pak for Data functions industry-wide and thus continues the series of ready-configured systems ("Appliance" or "Hyperconverged System").

Headerbild zu Talend Data Fabric

Talend Data Fabric

The ultimate solution for your data needs – Talend Data Fabric includes everything your (Data Integration) heart desires and serves all integration needs relating to applications, systems and data.

Headerbild Talend Data Integration

Talend Data Integration

Talend Data Integration offers a highly scalable architecture for almost any application and any data source - with well over 900 connectors from cloud solutions like Salesforce to classic on-premises systems.

Headerbild Talend Application Integration

Talend Application Integration / ESB

With Talend Application Integration, you create a service-oriented architecture and connect, broker & manage your services and APIs in real time.

Headerbild zu Talend Real-Time Big Data Platform

Talend Real-Time Big Data Platform

Talend Big Data Platform simplifies complex integrations so you can successfully use Big Data with Apache Spark, Databricks, AWS, IBM Watson, Microsoft Azure, Snowflake, Google Cloud Platform and NoSQL.

Headerbild zu Microsoft SQL Server

Microsoft SQL Server

SQL Server 2019 offers companies recognized good and extensive functions for building an analytical solution. Both data integration, storage, analysis and reporting can be realized, and through the tight integration of PowerBI, extensive visualizations can be created and data can be given to consumers.

Headerbild zu Microsoft Power BI

Microsoft Power BI

Power BI is the ideal complement to the Microsoft-centric analytic solution in the enterprise. As a standalone version "Power BI Desktop" it is free of charge. With Power BI, companies create quick, comprehensive and meaningful visual analyses.

Headerbild zu Microsoft Azure

Microsoft Azure

Azure is the cloud offering from Microsoft. Numerous services are provided in Azure, not only for analytical requirements. Particularly worth mentioning from an analytical perspective are services for data storage (relational, NoSQL and in-memory / with Microsoft or OpenSource technology), Azure Data Factory for data integration, numerous services including AI and, of course, services for BI, such as Power BI or Analysis Services.


Microsoft Azure Synapse Analytics

With Synapse, Microsoft has provided a platform for all aspects of analytics in the Azure Cloud. Within the platform, Synapse includes services for data integration, data storage of any size and big data analytics. Together with existing architecture templates, a solution for every analytical use case is created in a short time.

Navigationsbild zu Business Intelligence

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

Navigationsbild zu Data Science

Data Science, Artificial Intelligence and Machine Learning

For some time, Data Science has been considered the supreme discipline in the recognition of valuable information in large amounts of data. It promises to extract hidden, valuable information from data of any structure.

Headerbild zu Dashboards und Reports

Dashboards & Reports

The discipline of Business Intelligence provides the necessary means for accessing data. In addition, various methods have developed that help to transport information to the end user through various technologies.

Headerbild zu IBM Watson® Knowledge Catalog

IBM Watson® Knowledge Catalog/Information Governance Catalog

Today, "IGC" is a proprietary enterprise cataloging and metadata management solution that is the foundation of all an organization's efforts to comply with rules and regulations or document analytical assets.

Headerbild zu IBM DB2


The IBM Db2database has been established on the market for many years as the leading data warehouse database in addition to its classic use in operations.

Headerbild zu IBM Netezza Performance Server

IBM Netezza Performance Server

IBM offers Database technology for specific purposes in the form of appliance solutions. In the Data Warehouse environment, the Netezza technology, later marketed under the name "IBM PureData for Analytics", is particularly well known.

Headerbild zu IBM Planning Analytics mit Watson

IBM Planning Analytics mit Watson

IBM Planning Analytics with Watsons enables the automation of planning, budgeting, forecasting and analysis processes using IBM TM1.

Headerbild für IBM SPSS

IBM SPSS Modeler

IBM SPSS Modeler is a tool that can be used to model and execute tasks, for example in the field of Data Science and Data Mining, via a graphical user interface.

Headerbild zu IBM Watson Studio

IBM Watson Studio

IBM Watson Studio is an integrated solution for implementing a data science landscape. It helps companies to structure and simplify the process from exploratory analysis to the implementation and operationalisation of the analysis processes.

Headerbild zu IBM Cognos Analytics 11

IBM Cognos Analytics 11

IBM Cognos Analytics is a central platform for the provision of dispositive information in the company. With the reporting and analysis functions of IBM Cognos, the relevant information can be prepared and used throughout the company.

Headerbild zu IBM DataStage

IBM InfoSphere Information Server

IBM Information Server is a central platform for enterprise-wide information integration. With IBM Information Server, business information can be extracted, consolidated and merged from a wide variety of sources.

Headerbild zu IBM Decision Optimization

Decision Optimization

Mathematical algorithms enable fast and efficient improvement of partially contradictory specifications. As an integral part of the IBM Data Science platform "Cloud Pak for Data" or "IBM Watson Studio", decision optimisation has been decisively expanded and embedded in the Data Science process.


Smarter mobility with the portal switchh

Subway, S-Bahn, bus, car, ferry or bicycle: The pilot project "switchh" of HOCHBAHN in cooperation with Europcar and Car2Go makes Hamburg mobile.


Continuous license support pays off

If IBM software is used, the compliance-conforming use of the IBM License Metric Tool (ILMT) plays a decisive role. TIMETOACT ensures compliance with IBM regulations for a city administration in North Rhine-Westphalia and continuously monitors the correct licensing.