The best Large Language Models of September 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in September.

LLM Benchmarks | September 2024

The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Model Code Crm Docs Integrate Marketing Reason Final Cost Speed
GPT o1-preview v1/2024-09-12 ☁️ 95 92 94 96 88 87 92 52.32 € 0.08 rps
GPT o1-mini v1/2024-09-12 ☁️ 93 96 94 85 82 87 90 8.15 € 0.16 rps
Google Gemini 1.5 Pro v2 ☁️ 86 97 94 100 78 74 88 1.00 € 1.18 rps
GPT-4o v1/2024-05-13 ☁️ 90 96 100 89 78 74 88 1.21 € 1.44 rps
GPT-4o v3/dyn-2024-08-13 ☁️ 90 97 100 81 79 78 88 1.22 € 1.21 rps
GPT-4 Turbo v5/2024-04-09 ☁️ 86 99 98 100 88 43 86 2.45 € 0.84 rps
GPT-4o v2/2024-08-06 ☁️ 90 84 97 92 82 59 84 0.63 € 1.49 rps
Google Gemini 1.5 Pro 0801 ☁️ 84 92 79 100 70 74 83 0.90 € 0.83 rps
Qwen 2.5 72B Instruct ⚠️ 79 92 94 100 71 59 83 0.10 € 0.66 rps
Llama 3.1 405B Hermes 3🦙 68 93 89 100 88 53 82 0.54 € 0.49 rps
GPT-4 v1/0314 ☁️ 90 88 98 70 88 45 80 7.04 € 1.31 rps
GPT-4 v2/0613 ☁️ 90 83 95 70 88 45 78 7.04 € 2.16 rps
Claude 3 Opus ☁️ 69 88 100 78 76 58 78 4.69 € 0.41 rps
Claude 3.5 Sonnet ☁️ 72 83 89 85 80 58 78 0.94 € 0.09 rps
GPT-4 Turbo v4/0125-preview ☁️ 66 97 100 85 75 43 78 2.45 € 0.84 rps
GPT-4o Mini ☁️ 63 87 80 70 100 65 78 0.04 € 1.46 rps
Meta Llama3.1 405B Instruct🦙 81 93 92 70 75 48 76 2.39 € 1.16 rps
GPT-4 Turbo v3/1106-preview ☁️ 66 75 98 70 88 60 76 2.46 € 0.68 rps
DeepSeek v2.5 236B ⚠️ 57 80 91 78 88 57 75 0.03 € 0.42 rps
Google Gemini 1.5 Flash v2 ☁️ 64 96 89 75 81 44 75 0.06 € 2.01 rps
Google Gemini 1.5 Pro 0409 ☁️ 68 97 96 85 75 26 74 0.95 € 0.59 rps
Meta Llama 3.1 70B Instruct f16🦙 74 89 90 70 75 48 74 1.79 € 0.90 rps
GPT-3.5 v2/0613 ☁️ 68 81 73 81 81 50 72 0.34 € 1.46 rps
Meta Llama 3 70B Instruct🦙 81 83 84 60 81 45 72 0.06 € 0.85 rps
Mistral Large 123B v2/2407 ☁️ 68 79 68 75 75 70 72 0.86 € 1.02 rps
Google Gemini 1.5 Pro 0514 ☁️ 73 96 79 100 25 60 72 1.07 € 0.92 rps
Google Gemini 1.5 Flash 0514 ☁️ 32 97 100 75 72 52 71 0.06 € 1.77 rps
Google Gemini 1.0 Pro ☁️ 66 86 83 78 88 28 71 0.37 € 1.36 rps
Meta Llama 3.2 90B Vision🦙 74 84 87 78 71 32 71 0.23 € 1.10 rps
GPT-3.5 v3/1106 ☁️ 68 70 71 78 78 58 70 0.24 € 2.33 rps
GPT-3.5 v4/0125 ☁️ 63 87 71 78 78 43 70 0.12 € 1.43 rps
Qwen1.5 32B Chat f16 ⚠️ 70 90 82 78 78 20 69 0.97 € 1.66 rps
Cohere Command R+ ☁️ 63 80 76 70 70 58 69 0.83 € 1.90 rps
Gemma 2 27B IT ⚠️ 61 72 87 70 89 32 69 0.07 € 0.90 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 68 87 67 70 88 25 67 0.32 € 3.39 rps
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ 63 67 84 60 81 46 67 0.21 € 5.09 rps
Meta Llama 3 8B Instruct f16🦙 79 62 68 70 80 41 67 0.32 € 3.33 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 63 73 72 69 88 30 66 0.32 € 3.40 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 58 72 72 70 88 33 65 0.49 € 2.20 rps
GPT-3.5-instruct 0914 ☁️ 47 92 69 62 88 33 65 0.35 € 2.15 rps
GPT-3.5 v1/0301 ☁️ 55 82 69 78 82 26 65 0.35 € 4.12 rps
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ 76 51 76 60 88 38 65 0.28 € 3.79 rps
Mistral Nemo 12B v1/2407 ☁️ 54 58 51 100 75 49 64 0.03 € 1.22 rps
Meta Llama 3.2 11B Vision🦙 70 71 65 70 71 36 64 0.04 € 1.49 rps
Starling 7B-alpha f16 ⚠️ 58 66 67 70 88 34 64 0.58 € 1.85 rps
Llama 3 8B Hermes 2 Theta🦙 61 73 74 70 85 16 63 0.05 € 0.55 rps
Yi 1.5 34B Chat f16 ⚠️ 47 78 70 70 86 26 63 1.18 € 1.37 rps
Claude 3 Haiku ☁️ 64 69 64 70 75 35 63 0.08 € 0.52 rps
Meta Llama 3.1 8B Instruct f16🦙 57 74 62 70 74 32 61 0.45 € 2.41 rps
Qwen2 7B Instruct f32 ⚠️ 50 81 81 60 66 31 61 0.46 € 2.36 rps
Mistral Small v3/2409 ☁️ 43 75 71 75 75 26 61 0.06 € 0.81 rps
Claude 3 Sonnet ☁️ 72 41 74 70 78 28 61 0.95 € 0.85 rps
Mixtral 8x22B API (Instruct) ☁️ 53 62 62 100 75 7 60 0.17 € 3.12 rps
Mistral Pixtral 12B ✅ 53 69 73 60 64 40 60 0.03 € 0.83 rps
Codestral Mamba 7B v1 ✅ 53 66 51 100 71 17 60 0.30 € 2.82 rps
Anthropic Claude Instant v1.2 ☁️ 58 75 65 75 65 16 59 2.10 € 1.49 rps
Cohere Command R ☁️ 45 66 57 70 84 27 58 0.13 € 2.50 rps
Anthropic Claude v2.0 ☁️ 63 52 55 60 84 34 58 2.19 € 0.40 rps
Qwen1.5 7B Chat f16 ⚠️ 56 81 60 50 60 36 57 0.29 € 3.76 rps
Mistral Large v1/2402 ☁️ 37 49 70 78 84 25 57 0.58 € 2.11 rps
Microsoft WizardLM 2 8x22B ⚠️ 48 76 79 50 62 22 56 0.13 € 0.70 rps
Qwen1.5 14B Chat f16 ⚠️ 50 58 51 70 84 22 56 0.36 € 3.03 rps
Anthropic Claude v2.1 ☁️ 29 58 59 78 75 32 55 2.25 € 0.35 rps
Llama2 13B Vicuna-1.5 f16🦙 50 37 55 60 82 37 53 0.99 € 1.09 rps
Mistral 7B Instruct v0.1 f16 ☁️ 34 71 69 59 62 23 53 0.75 € 1.43 rps
Mistral 7B OpenOrca f16 ☁️ 54 57 76 25 78 27 53 0.41 € 2.65 rps
Meta Llama 3.2 3B🦙 52 71 66 70 44 14 53 0.01 € 1.25 rps
Google Recurrent Gemma 9B IT f16 ⚠️ 58 27 71 60 56 23 49 0.89 € 1.21 rps
Codestral 22B v1 ✅ 38 47 44 78 66 13 48 0.06 € 4.03 rps
Llama2 13B Hermes f16🦙 50 24 37 74 60 42 48 1.00 € 1.07 rps
IBM Granite 34B Code Instruct f16 ☁️ 63 49 34 70 57 7 47 1.07 € 1.51 rps
Mistral Small v2/2402 ☁️ 33 42 45 92 56 8 46 0.06 € 3.21 rps
DBRX 132B Instruct ⚠️ 43 39 43 77 59 10 45 0.26 € 1.31 rps
Mistral Medium v1/2312 ☁️ 41 43 44 61 62 12 44 0.81 € 0.35 rps
Meta Llama 3.2 1B🦙 32 40 33 40 68 51 44 0.02 € 1.69 rps
Llama2 13B Puffin f16🦙 37 15 44 70 56 39 43 4.70 € 0.23 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 67 63 52 56 8 43 0.06 € 2.21 rps
Microsoft WizardLM 2 7B ⚠️ 53 34 42 59 53 13 42 0.02 € 0.89 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 22 47 59 38 62 8 39 0.05 € 2.39 rps
Gemma 2 9B IT ⚠️ 45 25 47 34 68 13 38 0.02 € 0.88 rps
Meta Llama2 13B chat f16🦙 22 38 17 60 75 6 36 0.75 € 1.44 rps
Mistral 7B Zephyr-β f16 ✅ 37 34 46 59 29 4 35 0.46 € 2.34 rps
Meta Llama2 7B chat f16🦙 22 33 20 60 50 18 34 0.56 € 1.93 rps
Mistral 7B Notus-v1 f16 ⚠️ 10 54 25 52 48 4 32 0.75 € 1.43 rps
Orca 2 13B f16 ⚠️ 18 22 32 22 67 20 30 0.95 € 1.14 rps
Mistral 7B v0.1 f16 ☁️ 0 9 48 53 52 12 29 0.87 € 1.23 rps
Mistral 7B Instruct v0.2 f16 ☁️ 11 30 54 12 58 8 29 0.96 € 1.12 rps
Google Gemma 2B IT f16 ⚠️ 33 28 16 57 15 20 28 0.30 € 3.54 rps
Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️ 5 34 30 11 47 8 22 0.82 € 1.32 rps
Orca 2 7B f16 ⚠️ 22 0 26 20 52 4 21 0.78 € 1.38 rps
Google Gemma 7B IT f16 ⚠️ 0 0 0 9 62 0 12 0.99 € 1.08 rps
Meta Llama2 7B f16🦙 0 5 22 3 28 2 10 0.95 € 1.13 rps
Yi 1.5 9B Chat f16 ⚠️ 0 4 29 9 0 8 8 1.41 € 0.76 rps
Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

CRM

How well does the model support work with product catalogs and marketplaces?

Docs

How well can the model work with large documents and knowledge bases?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

ChatGPT o1 models are the best

OpenAI has released a radically new type of the model called o1-preview that is followed by o1-mini. These unique models differ from all the other LLM models out there - they run their own chain of thought routine for each request. This allow the model to decompose complex problems in smaller tasks and really think the answers through.

That approach, for example, shines in complex full-stack software engineering challenges. o1, if compared to the “ordinary” GPT-4 feels like an experienced Middle Level Software Engineer that requires surprisingly little hand-holding.

There is one downside in this “chain of thought under the hood” process. O1 produces high quality results, but these results take time and cost a lot more. Just look at the comparative pricing within the Cost column.

We are looking forward to see other LLM vendors take a note of this trick and release their own versions of LLMs with tuned chain-of-thought routine.

Google Gemini 1.5 Pro v 002 - TOP 3

While speaking of the top results and cloud vendors, there is another new model in the TOP-3. Google has somehow managed to catch up with the rate of the progress and release highly competitive model - Gemini 1.5 Pro v 002.

This model systematically improves over the previous version in multiple categories: Code, CRM, Docs, and Marketing texts. It is also the cheapest model in the TOP-6 of our benchmark.

Practitioners already praise this model for great multi-lingual skills, while users of Google Cloud are happy to have top-tier LLM available in their cloud.

For a long time, it felt like OpenAI and Anthropic are the only companies that can really push the state of the art in top-tier LLM models. It also felt like large mammoth companies are just too slow and old-school to release something worthy. Google was eventually able to prove this wrong.

This is how the progress of Google models looks over the time:

Now it doesn’t feel out of the ordinary to expect models of similar quality from Amazon or Microsoft. Perhaps, this will spur forward a round of competition with further price drops and further quality improvements.
Enough with the cloud vendors. Let’s talk about local models now.

(Local models are the models that you can download and run on your own hardware.)

Qwen 2.5 and DeepSeek 2.5

Recently released Qwen 2.5 Instruct is surprisingly good. This is the first local model that beats Claude 3.5 Sonnet on our business tasks. It also costs less than the other LLM models in the top.

Starting from this benchmark we’ll use OpenRouter pricing as the base price for locally-capable LLM models. This allows to estimate workload costs based on the real-world market. It also factors in any meaningful performance optimisations that LLM vendors are willing to use to improve their margins.

Qwen 2.5 72B diligently follows instructions (if compared to Sonnet 3.5 or older GPT-4 versions) and has a decent Reason capability. This Chinese model has gaps in Code and Marketing capabilities.

DeepSeek 2.5 didn’t perform nearly as well in our product benchmarks, despite having a huge size of 236B parameters. It runs roughly on the level of older versions of GPT-4 and Gemini 1.5 Pro.

These actually are outstanding news: more and more local models reach the level of GPT-4 Turbo intelligence. And the fact that a smaller Qwen 72B model has beaten it by a big margin - is worth a separate celebration 🚀

We think, this is not the last celebration of this kind for this year.

Llama 3.2 - Mediocre results, but there is a small nuance

Meta has just released their new versions of Llama - 3.2 model range.

Larger models are now multi-modal. This happened at the cost of the cognitive capabilities in text-driven business tasks, if compared to the previous model versions. Llama 3.2 is still far from the top.

If we look at the table:

  • Llama 3.2 90B Vision works on the level of Llama 3/3.1 70B but with worse Reason.

  • Llama 3.2 11B Vision works on the level of previous 8B, but with worse reason.

This doesn’t make the new models worse - they have more capabilities now. Our benchmark currently tests only text-based business tasks. Vision tasks will be added in v2.

Having said that, there is a small nuance that really makes this Llama 3.2 release outstanding. Size of that nuance is 1B and 3B. These are the sizes of new tiny Llama 3.2 models that are designed to run in resource-constrained environments and on the edge (optimised for ARM processors, Qualcomm and MediaTek hardware). Despite resource constraints, these models feature 128k token context and surprisingly high response quality in business tasks.

For example, do you remember a huge DBRX 132B Instruct model that claimed to be “a new state-of-the-art for established open LLMs”? Well, Llama 3.2 1B model catches up with it in our benchmark and 3B beats it by a big margin. Just look at the neighbours of these models on this table:

Keep in mind that these benchmarks results are for the base Llama versions. Customised fine-tunes tend to improve overall scores even further.

As you can see, the progress doesn’t stand still. We’ll be waiting for the continuation of the trend where more and more companies manage to package better cognitive capability in smaller models.

To visualise such a trend, we’ve plotted all releases of locally-capable models over the timeline. Then we’ve grouped them together based on the rough hardware requirements for running them. For each group we’ve computed current trend (linregress)

Note: this grouping is very rough. We are using for most frequent hardware combinations that we’ve seen among our customers and within the AI Research. We are also assuming that we are running inference under fp16 without any further quantisations and with enough spare VRAM to keep some context around.

Here are a few observations.

  • All models are getting better over the time - both small and big ones.
  • Interesting large models started showing up on the radar only this year.
  • Large models currently have the fastest rage of improvement.

These observations are obvious. You don’t need a chart to figure them out. However visualisations make rate of the progress more comprehensible. It could then be translated to customers and accounted for in long-term plans.

Transform Your Digital Projects with the Best AI Language Models!

Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.

 

* required

Wir verwenden die von Ihnen an uns gesendeten Angaben nur, um auf Ihren Wunsch hin mit Ihnen Kontakt im Zusammenhang mit Ihrer Anfrage aufzunehmen. Alle weiteren Informationen können Sie unseren Datenschutzhinweisen entnehmen.

Please solve captcha!

captcha image
Martin Warnung
Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80
Blog 11/12/24

ChatGPT & Co: LLM Benchmarks for October

Find out which large language models outperformed in the October 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 10/29/24

Third Place - AIM Hackathon 2024: The Venturers

ESG reports are often filled with vague statements, obscuring key facts investors need. This team created an AI prototype that analyzes these reports sentence-by-sentence, categorizing content to produce a "relevance map".

Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Wissen 4/30/24

GPT & Co: The best language models for digital products

Our analysis based on real benchmark data reveals which solutions excel in document processing, CRM integration, external integration, marketing support and code generation. Find your ideal model!

Wissen 7/23/24

Graph Databases in the Supply Chain

The supply chain is a complex network of suppliers, manufacturers, retailers and logistics service providers designed to ensure the smooth flow of goods and information. The modern supply chain faces numerous challenges.

Wissen 3/20/24

Unique insights through graph databases

Graph databases equip companies with distinctive insights, fostering a significant competitive edge.

Wissen 7/23/24

Graph Databases in the Supply Chain

The supply chain is a complex network of suppliers, manufacturers, retailers and logistics service providers designed to ensure the smooth flow of goods and information. The modern supply chain faces numerous challenges.

Wissen 3/20/24

Unique insights through graph databases

Graph databases equip companies with distinctive insights, fostering a significant competitive edge.

Blog 7/22/24

Let's build an Enterprise AI Assistant

In the previous blog post we have talked about basic principles of building AI assistants. Let’s take them for a spin with a product case that we’ve worked on: using AI to support enterprise sales pipelines.

Blog 7/22/24

So You are Building an AI Assistant?

So you are building an AI assistant for the business? This is a popular topic in the companies these days. Everybody seems to be doing that. While running AI Research in the last months, I have discovered that many companies in the USA and Europe are building some sort of AI assistant these days, mostly around enterprise workflow automation and knowledge bases. There are common patterns in how such projects work most of the time. So let me tell you a story...