The best language models for digital products in may 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in may 2024.

Based on real benchmark data from our own software products, we evaluated the performance of different LLM models in addressing specific challenges. We examined specific categories such as document processing, CRM integration, external integration, marketing support, and code generation.

Highlights:

New Google Gemini 1.5 version - better than the previous ones
GPT-4o - best in this benchmark, but has caveats
Qwen1.5 Chat - one of the best downloadable models
IBM Granite Code Instruct - decent coding results with clean data provenance

LLM Benchmarks | May 2024

Our benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama2 license

A more detailed explanation of the respective categories can be found below the table.

Hide Cost

Model	Code	Crm	Docs	Integrate	Marketing	Reason	Final	cost	Speed
GPT-4o ☁️	85	95	100	90	82	75	88	1.24 €	1.49 rps
GPT-4 Turbo v5/2024-04-09 ☁️	80	99	98	93	88	45	84	2.51 €	0.83 rps
GPT-4 v1/0314 ☁️	80	88	98	52	88	50	76	7.19 €	1.26 rps
GPT-4 Turbo v4/0125-preview ☁️	60	97	100	71	75	45	75	2.51 €	0.82 rps
GPT-4 v2/0613 ☁️	80	83	95	52	88	50	74	7.19 €	2.07 rps
Claude 3 Opus ☁️	64	88	100	53	76	59	73	4.83 €	0.41 rps
GPT-4 Turbo v3/1106-preview ☁️	60	75	98	52	88	62	72	2.52 €	0.68 rps
Gemini Pro 1.5 0514 ☁️	67	96	75	100	25	62	71	2.06 €	0.91 rps
Gemini Pro 1.5 0409 ☁️	62	97	96	63	75	28	70	1.89 €	0.58 rps
GPT-3.5 v2/0613 ☁️	62	79	73	75	81	48	70	0.35 €	1.39 rps
GPT-3.5 v3/1106 ☁️	62	68	71	63	78	59	67	0.24 €	2.29 rps
Gemini 1.5 Flash 0514 ☁️	32	97	100	56	72	41	66	0.10 €	1.76 rps
GPT-3.5 v4/0125 ☁️	58	85	71	60	78	47	66	0.13 €	1.41 rps
Gemini Pro 1.0 ☁️	55	86	83	60	88	26	66	0.10 €	1.35 rps
Cohere Command R+ ☁️	58	77	76	49	70	59	65	0.85 €	1.88 rps
Qwen1.5 32B Chat f16 ⚠️	64	87	82	56	78	15	64	1.02 €	1.61 rps
GPT-3.5-instruct 0914 ☁️	44	90	69	60	88	32	64	0.36 €	2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅	56	86	67	52	88	26	62	0.37 €	2.99 rps
Meta Llama 3 8B Instruct f16🦙	74	60	68	49	80	42	62	0.35 €	3.16 rps
GPT-3.5 v1/0301 ☁️	49	75	69	67	82	24	61	0.36 €	3.93 rps
Starling 7B-alpha f16 ⚠️	51	66	67	52	88	36	60	0.61 €	1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅	46	72	72	49	88	31	60	0.51 €	2.14 rps
Claude 3 Haiku ☁️	59	69	64	55	75	33	59	0.08 €	0.53 rps
Mixtral 8x22B API (Instruct) ☁️	47	62	62	94	75	7	58	0.18 €	3.01 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅	51	74	72	41	75	31	57	0.36 €	3.05 rps
Claude 3 Sonnet ☁️	67	41	74	52	78	30	57	0.97 €	0.85 rps
Mistral Large v1/2402 ☁️	33	49	70	75	84	25	56	2.19 €	2.04 rps
Anthropic Claude Instant v1.2 ☁️	51	75	65	59	65	14	55	2.15 €	1.47 rps
Anthropic Claude v2.0 ☁️	57	52	55	45	84	35	55	2.24 €	0.40 rps
Cohere Command R ☁️	39	63	57	55	84	26	54	0.13 €	2.47 rps
Qwen1.5 7B Chat f16 ⚠️	51	81	60	34	60	36	54	0.30 €	3.62 rps
Anthropic Claude v2.1 ☁️	36	58	59	60	75	33	53	2.31 €	0.35 rps
Qwen1.5 14B Chat f16 ⚠️	44	58	51	49	84	17	51	0.38 €	2.90 rps
Meta Llama 3 70B Instruct b8🦙	46	72	53	29	82	18	50	7.32 €	0.22 rps
Mistral 7B OpenOrca f16 ☁️	42	57	76	21	78	26	50	0.43 €	2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️	31	70	69	44	62	21	50	0.79 €	1.39 rps
Llama2 13B Vicuna-1.5 f16🦙	36	37	53	39	82	38	48	1.02 €	1.07 rps
Llama2 13B Hermes f16🦙	38	23	30	61	60	43	42	1.03 €	1.06 rps
Llama2 13B Hermes b8🦙	32	24	29	61	60	43	42	4.94 €	0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️	10	58	65	51	56	8	41	0.19 €	2.17 rps
Mistral Small v2/2402 ☁️	27	35	36	82	56	8	41	0.19 €	3.14 rps
IBM Granite 34B Code Instruct f16 ☁️	52	49	30	44	57	5	40	1.12 €	1.46 rps
Llama2 13B Puffin f16🦙	37	12	38	48	56	41	39	4.89 €	0.22 rps
Mistral Medium v1/2312 ☁️	36	30	27	59	62	12	38	0.83 €	0.35 rps
Llama2 13B Puffin b8🦙	37	9	37	46	56	39	37	8.65 €	0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️	13	39	57	40	59	8	36	0.05 €	2.30 rps
Llama2 13B chat f16🦙	15	38	17	45	75	8	33	0.76 €	1.43 rps
Llama2 13B chat b8🦙	15	38	15	45	75	6	32	3.35 €	0.33 rps
Mistral 7B Zephyr-β f16 ✅	28	34	46	44	29	4	31	0.51 €	2.14 rps
Llama2 7B chat f16🦙	20	33	20	42	50	20	31	0.59 €	1.86 rps
Mistral 7B Notus-v1 f16 ⚠️	16	43	25	41	48	4	30	0.80 €	1.37 rps
Orca 2 13B f16 ⚠️	15	22	32	22	67	19	29	0.99 €	1.11 rps
Microsoft Phi 3 Mini 4K Instruct f16 ⚠️	36	24	26	17	50	8	27	0.95 €	1.15 rps
Mistral 7B Instruct v0.2 f16 ☁️	7	21	50	13	58	8	26	1.00 €	1.10 rps
Mistral 7B f16 ☁️	0	4	42	42	52	12	25	0.93 €	1.17 rps
Orca 2 7B f16 ⚠️	13	0	24	18	52	4	19	0.81 €	1.34 rps
Llama2 7B f16🦙	0	2	18	3	28	2	9	1.01 €	1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

Docs

How well can the model work with large documents and knowledge bases?

CRM

How well does the model support work with product catalogs and marketplaces?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Deeper insights

Google Gemini 1.5 - Pro and Flash

The recent Google IO announcement was all about AI and Gemini Pro. Although Google managed to forget about Gemini Ultra (it was supposed to come out early this year), they did update a few models.

GEMINI PRO 1.5 0514: UPDATED CAPABILITIES

Gemini Pro 1.5 0514 trades off some document comprehension capabilities in favour of better reasoning. It comes out as a slightly better version than the previous Gemini Pro version.

In our experience it was also a bit buggy at this point. We’ve encountered a bunch of server errors and even managed to get “HARM_CATEGORY_DANGEROUS_CONTENT” flag on one of the business benchmarks.

EXCELLENT "INTEGRATE" SKILLS

Gemini Pro 1.5 scored a perfect "Integrate" score, where we measure LLM's ability to reliably follow instructions and work with external systems, plugins and data formats.

GEMINI 1.5 FLASH AS AN INTERESTING ALTERNATIVE TO GPT-3.5

Gemini 1.5 Flash is an interesting new addition to the family. It works well in document tasks and has decent reasoning capability. Combine that with a very low price, and you get a good alternative to GPT-3.5.

Just note Google Gemini's peculiar pricing model - they don't charge text by tokens, but by billable symbols(Unicode code points minus spaces). We have seen several developers and even SaaS systems make mistakes when estimating costs.

GPT-4o - Clearly in the lead, but with a caveat

GPT-4o looks perfect at first glance. It is faster and cheaper than GPT-4 turbo. It also has 128K context, scores higher, has native multi-modality and understands languages better.

Under the hood it has a new tokeniser with a bigger dictionary. It leads to reduced token counts.

Overall the model scores didn’t jump much higher, because we are already operating at the limit of our LLM Benchmark. There is just one gotcha. Our “Reason” category (ability of models to handle complex logical and reasoning tasks) was made inherently difficult. GPT-4o managed to increase score from 62 (GPT-4 Turbo v3/1106-preview) to 75.

What is the caveat?

You see, OpenAI seems to operate in cycles. They switch between: “let’s make a better model” and “let’s make a cheaper model without sacrificing quality too much”.

While the LLM Benchmarks don’t catch it, it feels like the GPT-4o model belongs to the cost reduction models. It works amazingly well on small prompts, however other benchmarks demonstrate that it is not as good in dealing with larger contexts, like the other GPT-4 models. It also feels lacking in reasoning, even though current benchmarks are not capable of catching this regression.

Hint

For the time being, GPT-4 Turbo v5/2024-04-09 is our recommended go-to model.

Qwen 1.5 Chat

HIGHLY DEMANDED BENCHMARK

Due to the high demand and good ratings in the LMSYS Arena, we decided to benchmark some flavours of Alibaba Cloud's Qwen Chat model.

EVALUATION OF QWEN 1.5 32B CHAT

Qwen 1.5 32B Chat is quite good. It is within the range of GPT-3.5 models and Gemini Pro 1.0. It comes with a non-standard license, though.

We have also tested Qwen 1.5 7B and 14B - they are quite good for their relative size. Nothing peculiar, just a decent performance.

The license of Qwen1.5 Chat is a Chinese equivalent of Llama 3: you can use it freely for commercial purposes, if you have less than 100M MAU. This might make model adoption tricky in the USA and EU.

IBM Granite 34B Code Instruct

That is the only peculiar thing about their models, though. While the previous versions of IBM Granite models were available only within the IBM Cloud, the model under test was published directly to Hugging Face.

However, that is the only special thing about their models. While the previous versions of the IBM Granite models were only available within the IBM Cloud, the tested model was published directly on Hugging Face.

Long story short, IBM Granite 34B Code Instruct has a decent code capability (for a 7B model) and bad results in pretty much everything else. If you need a local coding model at a fraction of compute cost, just pick Llama3 or one of its derivatives.

LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!

Learn more

Transform your digital projects with the best AI language models!

Discover the transformative power of the best LLM and revolutionize your digital products with AI! Stay future-oriented, increase efficiency and secure a clear competitive advantage. We support you in taking your business value to the next level.

First Name *

Last Name *

Company *

E-Mail *

Phone number

Message *

* required

We use the data you send us only for contacting you in connection with your request. You can find all further information in our privacy policy.

Martin Warnung

Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80

Contact

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Insights 3/17/25

LLM Benchmarks: February 2025

Discover the latest insights from our independent LLM benchmarks for February 2025. Find out which large language models performed best.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Wissen 4/30/24

GPT & Co: The best language models for digital products

Our analysis based on real benchmark data reveals which solutions excel in document processing, CRM integration, external integration, marketing support and code generation. Find your ideal model!

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Insights

LLM Benchmarks March 2025

What's new in the world of LLMs? Find out and read why Google DeepMind managed to surprise us more than once last month.

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 5/16/24

Common Mistakes in the Development of AI Assistants

We share how failures when implementing AI occurr and what can be learned from them for future projects: So that AI assistants can be implemented more successfully in the future!

Blog 5/17/24

8 tips for developing AI assistants

8 practical tips for implementing AI assistants

Blog 1/21/25

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Wissen 5/2/24

Unlock the Potential of Data Culture in Your Organization

Are you ready to revolutionize your organization's potential by unleashing the power of data culture? Imagine a workplace where every decision is backed by insights, every strategy informed by data, and every employee equipped to navigate the digital landscape with confidence. This is the transformative impact of cultivating a robust data culture within your enterprise.

Blog 3/11/25

Answering Business Questions with LLMs

8th place in Enterprise RAG Challenge 2025: Answering Business Questions with LLMs