The best language models for digital products

Benchmarks for ChatGPT and Co

Based on real benchmark data from our own software products, we evaluated the performance of different LLM models in addressing specific challenges. We examined specific categories such as document processing, CRM integration, external integration, marketing support, and code generation.  

Highlights:

LLM Benchmarks | April 2024

Our benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama2 license

A more detailed explanation of the respective categories can be found below the table.

modelcodecrmdocsintegratemarketingreasonfinal 🏆CostSpeed

GPT-4 Turbo v5/2024-04-09 ☁️

809998938845842.51 €0.83 rps
GPT-4 v1/0314 ☁️808898528850767.19 €1.26 rps
GPT-4 Turbo v4/0125-preview ☁️6097100717545752.51 €0.82 rps
GPT-4 v2/0613 ☁️808395528850747.19 €2.07 rps
Claude 3 Opus ☁️6488100537659734.83 €0.41 rps
GPT-4 Turbo v3/1106-preview ☁️607598528862722.52 €0.68 rps
Gemini Pro 1.5 ☁️629796637528701.89 €0.58 rps
GPT-3.5 v2/0613 ☁️627973758148700.35 €1.39 rps
GPT-3.5 v3/1106 ☁️626871637859670.24 €2.29 rps
GPT-3.5 v4/0125 ☁️588571607847660.13 €1.41 rps
Gemini Pro 1.0 ☁️558683608826660.10 €1.35 rps
Cohere Command R+ ☁️587776497059650.85 €1.88 rps
GPT-3.5-instruct 0914 ☁️449069608832640.36 €2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅568667528826620.37 €2.99 rps
Meta Llama 3 8B Instruct f16🦙746068498042620.35 €3.16 rps
GPT-3.5 v1/0301 ☁️497569678224610.36 €3.93 rps
Starling 7B-alpha f16 ⚠️516667528836600.61 €1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅467272498831600.51 €2.14 rps
Claude 3 Haiku ☁️596964557533590.08 €0.53 rps
Mixtral 8x22B API (Instruct) ☁️47626294757580.18 €3.01 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅517472417531570.36 €3.05 rps
Claude 3 Sonnet ☁️674174527830570.97 €0.85 rps
Mistral Large v1/2402 ☁️334970758425562.19 €2.04 rps
Anthropic Claude Instant v1.2 ☁️517565596514552.15 €1.47 rps
Anthropic Claude v2.0 ☁️575255458435552.24 €0.40 rps
Cohere Command R ☁️396357558426540.13 €2.47 rps
Anthropic Claude v2.1 ☁️365859607533532.31 €0.35 rps
Meta Llama 3 70B Instruct b8🦙467253298218507.32 €0.22 rps
Mistral 7B OpenOrca f16 ☁️425776217826500.43 €2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️317069446221500.79 €1.39 rps
Llama2 13B Vicuna-1.5 f16🦙363753398238481.02 €1.07 rps
Llama2 13B Hermes f16🦙382330616043421.03 €1.06 rps
Llama2 13B Hermes b8🦙322429616043424.94 €0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️10586551568410.19 €2.17 rps
Mistral Small v2/2402 ☁️27353682568410.19 €3.14 rps
Llama2 13B Puffin f16🦙371238485641394.89 €0.22 rps
Mistral Medium v1/2312 ☁️363027596212380.83 €0.35 rps
Llama2 13B Puffin b8🦙37937465639378.65 €0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️13395740598360.05 €2.30 rps
Llama2 13B chat f16🦙15381745758330.76 €1.43 rps
Llama2 13B chat b8🦙15381545756323.35 €0.33 rps
Mistral 7B Zephyr-β f16 ✅28344644294310.51 €2.14 rps
Llama2 7B chat f16🦙203320425020310.59 €1.86 rps
Mistral 7B Notus-v1 f16 ⚠️16432541484300.80 €1.37 rps
Orca 2 13B f16 ⚠️152232226719290.99 €1.11 rps
Mistral 7B Instruct v0.2 f16 ☁️7215013588261.00 €1.10 rps
Mistral 7B f16 ☁️0442425212250.93 €1.17 rps
Orca 2 7B f16 ⚠️1302418524190.81 €1.34 rps
Llama2 7B f16🦙0218328291.01 €1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

How well can the model work with large documents and knowledge bases?

How well does the model support work with product catalogs and marketplaces?

Can the model easily interact with external APIs, services and plugins?

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

How well can the model reason and draw conclusions in a given context?

Can the model generate code and help with programming?

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.


Deeper insights

Google Gemini Pro 1.5

The newer version, Gemini 1.5 Pro shows significantly better performance than Gemini 1.0 Pro in the last month. It almost reaches the performance of the GPT-4 Turbo.

This model performs particularly well on tasks related to working on documents and information. It also performs almost perfectly on CRM-related tasks. However, complex reasoning tasks are below the level of GPT-3.5.

Gemini Pro 1.5 is about 20 times more expensive than Pro 1.0 on our workloads, which is to be expected given the quality level of GPT-4.

Both models are now available in Google Vertex AI, finally making them usable for enterprise customers in the EU.

Command R models from Cohere

Cohere AI specializes in enterprise-oriented LLMs. They have the Command R family of models - LLMs designed for document-oriented tasks: "Command R" and "Command R Plus".

These models are available both as API SaaS and as downloadable models on Hugging Face. Downloadable models are published for non-commercial purposes.

The Command-R model is roughly comparable to the Anthropic Claude models of the first two generations, but significantly cheaper. Nevertheless, there are better models in this price category, such as Gemini Pro 1.0 and Claude 3 Haiku.

The Command R+ is a significantly better model with capabilities in the GPT-3.5 range, but at two to three times the price.

OpenAI reaches another milestone with new ChatGPT-4 Turbo

OpenAI has released the new GPT-4 Turbo model with the version number 2023-04-09. This is outstanding for two reasons.

  • First, OpenAI has finally used sensible version numbers. It only took one year of progress.

  • Second, this model beats all the other models on our LLM benchmarks. It takes the top place with a substantial gap to the second place.

This score jump comes from nearly perfect stores in CRM and Docs categories. Plus OpenAI has finally fixed the instruction following problem with few-shots, that was causing Integrate category to be so low.

GPT-4 Turbo 2023-04-09 is currently our default recommendation for the new LLM-driven projects that need the best performing LLM to get started.

Llama 3 70B and 8B

Meta has just released new models in its third generation. We have tested instruct versions of 70B and 8B for the usability in LLM-driven products.

Llama 3 70B had a bumpy start - the upload to HuggingFace had bugs with tokens in chat template processing. Once these were fixed, the model started working better, on the level of old generations of Anthropic Claude v2.

Note that we tested the b8 quantized model to properly fit 2xA100 80GB SMX cards. There is a possibility that f16 might give slightly better results.

Llama 3 8B Instruct performed much better on the benchmarks, pushing forward state of the art that is made available by Meta. This model has surprisingly good overall scores and a good “Reason” capability. There is a strong chance that product-oriented fine-tune of Llama3 8B Instruct would be able to push this model to the TOP-10.

Long-term trends

Now let's look at the bigger picture: Where is the industry heading with all this?

More cost-effective & more powerful models

Firstly, models are generally getting better and more affordable. This is the general trend that Sam Altman recently outlined in his interview.

Further long-term LLM trends

  • New functional capabilities of LLMs

    LLMs gain new functional capabilities that are not even covered in this benchmark: Function calls, multimodality, data grounding. The latest version of LLM Under the Hood expands on this theme.
     

  • Experiments with new LLM architectures

    Companies also get bold and try to experiment with new LLM architectures outside the classical transformers' architecture. Mixture of Experts was popularised by Mistral, although many believe that GPT also uses it. Recurrent Neural Networks also experience a comeback as a way to solve context size limitations. For example: RWKV Language Model, Recurrent Gemma from Google Deep Mind (Griffin architecture).

Powerful models with low computing power

What is interesting about these models - they show decent capabilities, while requiring substantially less compute. For instance, we got a report of 0.4B version of RWKV running on a low-end Android phone with a tolerable speed (CPU-only inference).

 

Where are we heading with all this?

DEMOCRATIZATION OF AI

Expect the models to continue to get better, cheaper and more powerful. Sam Altman calls this the "democratization of AI". This applies to both cloud models and locally available models.

If you are in the process of building an LLM-driven system, expect that by the time the system is delivered, the underlying LLM will be much more powerful. In fact, you can take this into account and build a long-term strategy around it.

ADAPTABLE SYSTEMS

You can do that, for example, by designing LLM-driven systems to be transparent, auditable and capable of continuously adapting to the changing context.


Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.


Christoph HasenzaglChristoph HasenzaglBlog
Blog

Common Mistakes in the Development of AI Assistants

How fortunate that people make mistakes: because we can learn from them and improve. We have closely observed how companies around the world have implemented AI assistants in recent months and have, unfortunately, often seen them fail. We would like to share with you how these failures occurred and what can be learned from them for future projects: So that AI assistants can be implemented more successfully in the future!

Christoph HasenzaglChristoph HasenzaglBlog
Blog

8 tips for developing AI assistants

AI assistants for businesses are hype, and many teams were already eagerly and enthusiastically working on their implementation. Unfortunately, however, we have seen that many teams we have observed in Europe and the US have failed at the task. Read about our 8 most valuable tips, so that you will succeed.

TIMETOACT
Referenz
Referenz

Standardized data management creates basis for reporting

TIMETOACT implements a higher-level data model in a data warehouse for TRUMPF Photonic Components and provides the necessary data integration connection with Talend. With this standardized data management, TRUMPF will receive reports based on reliable data in the future and can also transfer the model to other departments.

TIMETOACT
Technologie
Headerbild zu IBM Cloud Pak for Data Accelerator
Technologie

IBM Cloud Pak for Data Accelerator

For a quick start in certain use cases, specifically for certain business areas or industries, IBM offers so-called accelerators based on the "Cloud Pak for Data" solution, which serve as a template for project development and can thus significantly accelerate the implementation of these use cases. The platform itself provides all the necessary functions for all types of analytics projects, and the accelerators provide the respective content.

TIMETOACT
Martin LangeMartin LangeBlog
Checkliste als Symbol für die verschiedenen To Dos im Bereich Lizenzmanagement
Blog

License Management – Everything you need to know

License management is not only relevant in terms of compliance but can also minimize costs and risks. Read more in the article.

TIMETOACT
Technologie
Headerbild zu IBM Watson Knowledge Studio
Technologie

IBM Watson Knowledge Studio

In IBM Watson Knowledge Studio, you train an Artificial Intelligence (AI) on specialist terms of your company or specialist area ("domain knowledge"). In this way, you lay the foundation for automated text processing of extensive, subject-related documents.

TIMETOACT
Technologie
Headerbild zu IBM Watson Discovery
Technologie

IBM Watson Discovery

With Watson Discovery, company data is searched using modern AI to extract information. On the one hand, the AI uses already trained methods to understand texts; on the other hand, it is constantly developed through new training on the company data, its structure and content, thus constantly improving the search results.

TIMETOACT
Technologie
Headerbild zu Cloud Pak for Data – Test-Drive
Technologie

IBM Cloud Pak for Data – Test-Drive

By making our comprehensive demo and customer data platform available, we want to offer these customers a way to get a very quick and pragmatic impression of the technology with their data.

TIMETOACT
Referenz
Referenz

Managed service support for optimal license management

To ensure software compliance, TIMETOACT supports FUNKE Mediengruppe with a SAM Managed Service for Microsoft, Adobe, Oracle and IBM.

TIMETOACT
Technologie
Headerbild zu IBM Watson Assistant
Technologie

IBM Watson Assistant

Watson Assistant identifies intention in requests that can be received via multiple channels. Watson Assistant is trained based on real-live requests and can understand the context and intent of the query based on the acting AI. Extensive search queries are routed to Watson Discovery and seamlessly embedded into the search result.

TIMETOACT
Referenz
Referenz

Modernes Business Intelligence und Data Warehouse System

IBM Cloud Pak for Data System enables healthcare group AGAPLESION to effectively manage data and perform complex analyses.

TIMETOACT
Referenz
Referenz

Interactive online portal identifies suitable employees

TIMETOACT digitizes several test procedures for KI.TEST to determine professional intelligence and personality.

TIMETOACT
Referenz
Referenz

Flexibility in the data evaluation of a theme park

With the support of TIMETOACT, an theme park in Germany has been using TM1 for many years in different areas of the company to carry out reporting, analysis and planning processes easily and flexibly.

TIMETOACT
Service
Header Konnzeption individueller Business Intelligence Lösungen
Service

Conception of individual Analytics and Big Data solutions

We determine the best approach to develop an individual solution from the professional, role-specific requirements – suitable for the respective situation!

Service
Service

Application Integration & Process Automation

Digitizing and improving business processes and responding agilely to change – more and more companies are facing these kind of challenges. This makes it all the more important to take new business opportunities through integrated and optimized processes based on intelligent, digitally networked systems.

Service
Service

Cloud Transformation & Container Technologies

Public, private or hybrid? We can help you develop your cloud strategy so you can take full advantage of the technology.

TIMETOACT
Service
Headerbild zu Operationalisierung von Data Science (MLOps)
Service

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.

TIMETOACT
Service
Headerbild zu Digitale Planung, Forecasting und Optimierung
Service

Demand Planning, Forecasting and Optimization

After the data has been prepared and visualized via dashboards and reports, the task is now to use the data obtained accordingly. Digital planning, forecasting and optimization describes all the capabilities of an IT-supported solution in the company to support users in digital analysis and planning.

TIMETOACT
Service
Navigationsbild zu Data Science
Service

Data Science, Artificial Intelligence and Machine Learning

For some time, Data Science has been considered the supreme discipline in the recognition of valuable information in large amounts of data. It promises to extract hidden, valuable information from data of any structure.

TIMETOACT
Service
Teaserbild zu Data Integration Service und Consulting
Service

Data Integration, ETL and Data Virtualization

While the term "ETL" (Extract - Transform - Load / or ELT) usually described the classic batch-driven process, today the term "Data Integration" extends to all methods of integration: whether batch, real-time, inside or outside a database, or between any systems.

TIMETOACT
Technologie
Headerbild zu IBM Planning Analytics mit Watson
Technologie

IBM Planning Analytics mit Watson

IBM Planning Analytics with Watsons enables the automation of planning, budgeting, forecasting and analysis processes using IBM TM1.

TIMETOACT
Technologie
Headerbild zu IBM Decision Optimization
Technologie

Decision Optimization

Mathematical algorithms enable fast and efficient improvement of partially contradictory specifications. As an integral part of the IBM Data Science platform "Cloud Pak for Data" or "IBM Watson Studio", decision optimisation has been decisively expanded and embedded in the Data Science process.

Service
Service

Software, Mobile and Web App Development

Standard software often cannot completely fulfill a company's own requirements - TIMETOACT therefore develops customized software solutions.

Service
Service

Managed Service: Mailroom

In the TIMETOACT mailroom, business documents are converted into data in a highly efficient manner and returned securely to the end customer for further processing.

TIMETOACT
Technologie
Headerbild zu IBM Watson Studio
Technologie

IBM Watson Studio

IBM Watson Studio is an integrated solution for implementing a data science landscape. It helps companies to structure and simplify the process from exploratory analysis to the implementation and operationalisation of the analysis processes.

TIMETOACT
Technologie
Technologie

Microsoft Azure Synapse Analytics

With Synapse, Microsoft has provided a platform for all aspects of analytics in the Azure Cloud. Within the platform, Synapse includes services for data integration, data storage of any size and big data analytics. Together with existing architecture templates, a solution for every analytical use case is created in a short time.

TIMETOACT
Technologie
Headerbild für IBM SPSS
Technologie

IBM SPSS Modeler

IBM SPSS Modeler is a tool that can be used to model and execute tasks, for example in the field of Data Science and Data Mining, via a graphical user interface.

TIMETOACT
Technologie
Headerbild zu Microsoft Azure
Technologie

Microsoft Azure

Azure is the cloud offering from Microsoft. Numerous services are provided in Azure, not only for analytical requirements. Particularly worth mentioning from an analytical perspective are services for data storage (relational, NoSQL and in-memory / with Microsoft or OpenSource technology), Azure Data Factory for data integration, numerous services including AI and, of course, services for BI, such as Power BI or Analysis Services.

TIMETOACT
Technologie
Headerbild zu Talend Real-Time Big Data Platform
Technologie

Talend Real-Time Big Data Platform

Talend Big Data Platform simplifies complex integrations so you can successfully use Big Data with Apache Spark, Databricks, AWS, IBM Watson, Microsoft Azure, Snowflake, Google Cloud Platform and NoSQL.

TIMETOACT
Technologie
Headerbild Talend Application Integration
Technologie

Talend Application Integration / ESB

With Talend Application Integration, you create a service-oriented architecture and connect, broker & manage your services and APIs in real time.

TIMETOACT
Technologie
Headerbild zu Microsoft Power BI
Technologie

Microsoft Power BI

Power BI is the ideal complement to the Microsoft-centric analytic solution in the enterprise. As a standalone version "Power BI Desktop" it is free of charge. With Power BI, companies create quick, comprehensive and meaningful visual analyses.

TIMETOACT
Technologie
Headerbild IBM Cloud Pak for Data
Technologie

IBM Cloud Pak for Data

The Cloud Pak for Data acts as a central, modular platform for analytical use cases. It integrates functions for the physical and virtual integration of data into a central data pool - a data lake or a data warehouse, a comprehensive data catalogue and numerous possibilities for (AI) analysis up to the operational use of the same.

TIMETOACT
Technologie
Headerbild zu Microsoft SQL Server
Technologie

Microsoft SQL Server

SQL Server 2019 offers companies recognized good and extensive functions for building an analytical solution. Both data integration, storage, analysis and reporting can be realized, and through the tight integration of PowerBI, extensive visualizations can be created and data can be given to consumers.

TIMETOACT
Service
Headerbild zu Dashboards und Reports
Service

Dashboards & Reports

The discipline of Business Intelligence provides the necessary means for accessing data. In addition, various methods have developed that help to transport information to the end user through various technologies.

TIMETOACT
Technologie
Headerbild zu IBM Cloud Pak for Automation
Technologie

IBM Cloud Pak for Automation

The IBM Cloud Pak for Automation helps you automate manual steps on a uniform platform with standardised interfaces. With the Cloud Pak for Business Automation, the entire life cycle of a document or process can be mapped in the company.

TIMETOACT
Service
Navigationsbild zu Business Intelligence
Service

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

TIMETOACT
Technologie
Headerbild zu IBM Netezza Performance Server
Technologie

IBM Netezza Performance Server

IBM offers Database technology for specific purposes in the form of appliance solutions. In the Data Warehouse environment, the Netezza technology, later marketed under the name "IBM PureData for Analytics", is particularly well known.

TIMETOACT
Technologie
Haderbild zu IBM Cloud Pak for Application
Technologie

IBM Cloud Pak for Application

The IBM Cloud Pak for Application provides a solid foundation for developing, deploying and modernising cloud-native applications. Since agile working is essential for a faster release cycle, ready-made DevOps processes are used, among other things.

TIMETOACT
Technologie
Headerbild zu IBM DataStage
Technologie

IBM InfoSphere Information Server

IBM Information Server is a central platform for enterprise-wide information integration. With IBM Information Server, business information can be extracted, consolidated and merged from a wide variety of sources.

TIMETOACT
Service
Headerbild zu Data Governance Consulting
Service

Data Governance

Data Governance describes all processes that aim to ensure the traceability, quality and protection of data. The need for documentation and traceability increases exponentially as more and more data from different sources is used for decision-making and as a result of the technical possibilities of integration in Data Warehouses or Data Lakes.

TIMETOACT
Technologie
Headerbild zu Talend Data Fabric
Technologie

Talend Data Fabric

The ultimate solution for your data needs – Talend Data Fabric includes everything your (Data Integration) heart desires and serves all integration needs relating to applications, systems and data.

TIMETOACT
Technologie
Headerbild Talend Data Integration
Technologie

Talend Data Integration

Talend Data Integration offers a highly scalable architecture for almost any application and any data source - with well over 900 connectors from cloud solutions like Salesforce to classic on-premises systems.

TIMETOACT
Technologie
Headerbild IBM Cloud Pak for Data System
Technologie

IBM Cloud Pak for Data System

With the Cloud Pak for Data System (CP4DS), IBM provides the optimal hardware for the use of all Cloud Pak for Data functions industry-wide and thus continues the series of ready-configured systems ("Appliance" or "Hyperconverged System").

TIMETOACT
Technologie
Headerbild zu IBM Cognos Analytics 11
Technologie

IBM Cognos Analytics 11

IBM Cognos Analytics is a central platform for the provision of dispositive information in the company. With the reporting and analysis functions of IBM Cognos, the relevant information can be prepared and used throughout the company.

TIMETOACT
Technologie
Headerbild zu IBM DB2
Technologie

IBM Db2

The IBM Db2database has been established on the market for many years as the leading data warehouse database in addition to its classic use in operations.

TIMETOACT
Technologie
Headerbild zu IBM Watson® Knowledge Catalog
Technologie

IBM Watson® Knowledge Catalog/Information Governance Catalog

Today, "IGC" is a proprietary enterprise cataloging and metadata management solution that is the foundation of all an organization's efforts to comply with rules and regulations or document analytical assets.

TIMETOACT
Referenz
Referenz

Smarter mobility with the portal switchh

Subway, S-Bahn, bus, car, ferry or bicycle: The pilot project "switchh" of HOCHBAHN in cooperation with Europcar and Car2Go makes Hamburg mobile.

TIMETOACT
Referenz
Referenz

Custom licensing

MARKANT Handels und Service GmbH (MARKANT) is fully exploiting the potential of its IBM software licenses with this year's license renewal. Instead of relying on IBM's traditional Passport Advantage model as in the past, MARKANT is using a licensing concept specially adapted to the company for the first time.