The best Large Language Models of October 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in october.

We have a few new models to talk about, so let’s get started:

Grok2 from X.AI - Suddenly in TOP 15
Gemini 1.5 Flash 8B - Future Perfect
New Claude Sonnet 3.5 and Haiku 3.5 - Getting better

LLM Benchmarks | Oktober 2024

The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

CRM

How well does the model support work with product catalogs and marketplaces?

Docs

How well can the model work with large documents and knowledge bases?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Hide Cost

Model	Code	Crm	Docs	Integrate	Marketing	Reason	Final	cost	Speed
1. GPT o1-preview v1/2024-09-12 ☁️	95	92	94	96	88	87	92	52.32 €	0.08 rps
2. GPT o1-mini v1/2024-09-12 ☁️	93	96	94	85	82	87	90	8.15 €	0.16 rps
3. Google Gemini 1.5 Pro v2 ☁️	86	97	94	100	78	74	88	1.00 €	1.18 rps
4. GPT-4o v1/2024-05-13 ☁️	90	96	100	89	78	74	88	1.21 €	1.44 rps
5. GPT-4o v3/dyn-2024-08-13 ☁️	90	97	100	81	79	78	88	1.22 €	1.21 rps
6. GPT-4 Turbo v5/2024-04-09 ☁️	86	99	98	100	88	43	86	2.45 €	0.84 rps
7. GPT-4o v2/2024-08-06 ☁️	90	84	97	92	82	59	84	0.63 €	1.49 rps
8. Google Gemini 1.5 Pro 0801 ☁️	84	92	79	100	70	74	83	0.90 €	0.83 rps
9. Qwen 2.5 72B Instruct ⚠️	79	92	94	100	71	59	83	0.10 €	0.66 rps
10. Llama 3.1 405B Hermes 3🦙	68	93	89	100	88	53	82	0.54 €	0.49 rps
11. Claude 3.5 Sonnet v2 ☁️	82	97	93	85	71	57	81	0.95 €	0.09 rps
12. GPT-4 v1/0314 ☁️	90	88	98	70	88	45	80	7.04 €	1.31 rps
13. X-AI Grok 2 ⚠️	63	93	87	89	88	58	79	1.03 €	0.31 rps
14. GPT-4 v2/0613 ☁️	90	83	95	70	88	45	78	7.04 €	2.16 rps
15. Claude 3 Opus ☁️	69	88	100	78	76	58	78	4.69 €	0.41 rps
16. Claude 3.5 Sonnet v1 ☁️	72	83	89	85	80	58	78	0.94 €	0.09 rps
17. GPT-4 Turbo v4/0125-preview ☁️	66	97	100	85	75	43	78	2.45 €	0.84 rps
18. GPT-4o Mini ☁️	63	87	80	70	100	65	78	0.04 €	1.46 rps
19. Meta Llama3.1 405B Instruct🦙	81	93	92	70	75	48	76	2.39 €	1.16 rps
20. GPT-4 Turbo v3/1106-preview ☁️	66	75	98	70	88	60	76	2.46 €	0.68 rps
21. DeepSeek v2.5 236B ⚠️	57	80	91	78	88	57	75	0.03 €	0.42 rps
22. Google Gemini 1.5 Flash v2 ☁️	64	96	89	75	81	44	75	0.06 €	2.01 rps
23. Google Gemini 1.5 Pro 0409 ☁️	68	97	96	85	75	26	74	0.95 €	0.59 rps
24. Meta Llama 3.1 70B Instruct f16🦙	74	89	90	70	75	48	74	1.79 €	0.90 rps
25. Google Gemini Flash 1.5 8B ☁️	70	93	78	69	76	48	72	0.01 €	1.19 rps
26. GPT-3.5 v2/0613 ☁️	68	81	73	81	81	50	72	0.34 €	1.46 rps
27. Meta Llama 3 70B Instruct🦙	81	83	84	60	81	45	72	0.06 €	0.85 rps
28. Mistral Large 123B v2/2407 ☁️	68	79	68	75	75	70	72	0.86 €	1.02 rps
29. Google Gemini 1.5 Pro 0514 ☁️	73	96	79	100	25	60	72	1.07 €	0.92 rps
30. Google Gemini 1.5 Flash 0514 ☁️	32	97	100	75	72	52	71	0.06 €	1.77 rps
31. Google Gemini 1.0 Pro ☁️	66	86	83	78	88	28	71	0.37 €	1.36 rps
32. Meta Llama 3.2 90B Vision🦙	74	84	87	78	71	32	71	0.23 €	1.10 rps
33. GPT-3.5 v3/1106 ☁️	68	70	71	78	78	58	70	0.24 €	2.33 rps
34. GPT-3.5 v4/0125 ☁️	63	87	71	78	78	43	70	0.12 €	1.43 rps
35. Claude 3.5 Haiku ☁️	52	80	72	70	75	68	70	0.32 €	1.24 rps
36. Qwen1.5 32B Chat f16 ⚠️	70	90	82	78	78	20	69	0.97 €	1.66 rps
37. Cohere Command R+ ☁️	63	80	76	70	70	58	69	0.83 €	1.90 rps
38. Gemma 2 27B IT ⚠️	61	72	87	70	89	32	69	0.07 €	0.90 rps
39. Mistral 7B OpenChat-3.5 v3 0106 f16 ✅	68	87	67	70	88	25	67	0.32 €	3.39 rps
40. Gemma 7B OpenChat-3.5 v3 0106 f16 ✅	63	67	84	60	81	46	67	0.21 €	5.09 rps
41. Meta Llama 3 8B Instruct f16🦙	79	62	68	70	80	41	67	0.32 €	3.33 rps
42. Mistral 7B OpenChat-3.5 v2 1210 f16 ✅	63	73	72	69	88	30	66	0.32 €	3.40 rps
43. Mistral 7B OpenChat-3.5 v1 f16 ✅	58	72	72	70	88	33	65	0.49 €	2.20 rps
44. GPT-3.5-instruct 0914 ☁️	47	92	69	62	88	33	65	0.35 €	2.15 rps
45. GPT-3.5 v1/0301 ☁️	55	82	69	78	82	26	65	0.35 €	4.12 rps
46. Llama 3 8B OpenChat-3.6 20240522 f16 ✅	76	51	76	60	88	38	65	0.28 €	3.79 rps
47. Mistral Nemo 12B v1/2407 ☁️	54	58	51	100	75	49	64	0.03 €	1.22 rps
48. Meta Llama 3.2 11B Vision🦙	70	71	65	70	71	36	64	0.04 €	1.49 rps
49. Starling 7B-alpha f16 ⚠️	58	66	67	70	88	34	64	0.58 €	1.85 rps
50. Qwen 2.5 7B Instruct ⚠️	48	77	80	60	69	47	63	0.07 €	1.25 rps
51. Llama 3 8B Hermes 2 Theta🦙	61	73	74	70	85	16	63	0.05 €	0.55 rps
52. Yi 1.5 34B Chat f16 ⚠️	47	78	70	70	86	26	63	1.18 €	1.37 rps
53. Claude 3 Haiku ☁️	64	69	64	70	75	35	63	0.08 €	0.52 rps
54. Liquid: LFM 40B MoE ⚠️	72	69	65	60	82	24	62	0.00 €	1.45 rps
55. Meta Llama 3.1 8B Instruct f16🦙	57	74	62	70	74	32	61	0.45 €	2.41 rps
56. Qwen2 7B Instruct f32 ⚠️	50	81	81	60	66	31	61	0.46 €	2.36 rps
57. Mistral Small v3/2409 ☁️	43	75	71	75	75	26	61	0.06 €	0.81 rps
58. Claude 3 Sonnet ☁️	72	41	74	70	78	28	61	0.95 €	0.85 rps
59. Mixtral 8x22B API (Instruct) ☁️	53	62	62	100	75	7	60	0.17 €	3.12 rps
60. Mistral Pixtral 12B ✅	53	69	73	60	64	40	60	0.03 €	0.83 rps
61. Codestral Mamba 7B v1 ✅	53	66	51	100	71	17	60	0.30 €	2.82 rps
62. Inflection 3 Productivity ⚠️	46	59	39	70	79	61	59	0.92 €	0.17 rps
63. Anthropic Claude Instant v1.2 ☁️	58	75	65	75	65	16	59	2.10 €	1.49 rps
64. Cohere Command R ☁️	45	66	57	70	84	27	58	0.13 €	2.50 rps
65. Anthropic Claude v2.0 ☁️	63	52	55	60	84	34	58	2.19 €	0.40 rps
66. Qwen1.5 7B Chat f16 ⚠️	56	81	60	50	60	36	57	0.29 €	3.76 rps
67. Mistral Large v1/2402 ☁️	37	49	70	78	84	25	57	0.58 €	2.11 rps
68. Microsoft WizardLM 2 8x22B ⚠️	48	76	79	50	62	22	56	0.13 €	0.70 rps
69. Qwen1.5 14B Chat f16 ⚠️	50	58	51	70	84	22	56	0.36 €	3.03 rps
70. MistralAI Ministral 8B ✅	56	55	41	85	68	30	56	0.02 €	1.02 rps
71. MistralAI Ministral 3B ✅	50	48	39	92	60	41	55	0.01 €	1.02 rps
72. Anthropic Claude v2.1 ☁️	29	58	59	78	75	32	55	2.25 €	0.35 rps
73. Llama2 13B Vicuna-1.5 f16🦙	50	37	55	60	82	37	53	0.99 €	1.09 rps
74. Mistral 7B Instruct v0.1 f16 ☁️	34	71	69	59	62	23	53	0.75 €	1.43 rps
75. Mistral 7B OpenOrca f16 ☁️	54	57	76	25	78	27	53	0.41 €	2.65 rps
76. Meta Llama 3.2 3B🦙	52	71	66	70	44	14	53	0.01 €	1.25 rps
77. Google Recurrent Gemma 9B IT f16 ⚠️	58	27	71	60	56	23	49	0.89 €	1.21 rps
78. Codestral 22B v1 ✅	38	47	44	78	66	13	48	0.06 €	4.03 rps
79. Llama2 13B Hermes f16🦙	50	24	37	74	60	42	48	1.00 €	1.07 rps
80. IBM Granite 34B Code Instruct f16 ☁️	63	49	34	70	57	7	47	1.07 €	1.51 rps
81. Mistral Small v2/2402 ☁️	33	42	45	92	56	8	46	0.06 €	3.21 rps
82. DBRX 132B Instruct ⚠️	43	39	43	77	59	10	45	0.26 €	1.31 rps
83. NVIDIA Llama 3.1 Nemotron 70B Instruct🦙	68	54	25	74	28	21	45	0.09 €	0.53 rps
84. Mistral Medium v1/2312 ☁️	41	43	44	61	62	12	44	0.81 €	0.35 rps
85. Meta Llama 3.2 1B🦙	32	40	33	40	68	51	44	0.02 €	1.69 rps
86. Llama2 13B Puffin f16🦙	37	15	44	70	56	39	43	4.70 €	0.23 rps
87. Mistral Small v1/2312 (Mixtral) ☁️	10	67	63	52	56	8	43	0.06 €	2.21 rps
88. Microsoft WizardLM 2 7B ⚠️	53	34	42	59	53	13	42	0.02 €	0.89 rps
89. Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️	22	47	59	38	62	8	39	0.05 €	2.39 rps
90. Gemma 2 9B IT ⚠️	45	25	47	34	68	13	38	0.02 €	0.88 rps
91. Meta Llama2 13B chat f16🦙	22	38	17	60	75	6	36	0.75 €	1.44 rps
92. Mistral 7B Zephyr-β f16 ✅	37	34	46	59	29	4	35	0.46 €	2.34 rps
93. Meta Llama2 7B chat f16🦙	22	33	20	60	50	18	34	0.56 €	1.93 rps
94. Mistral 7B Notus-v1 f16 ⚠️	10	54	25	52	48	4	32	0.75 €	1.43 rps
95. Orca 2 13B f16 ⚠️	18	22	32	22	67	20	30	0.95 €	1.14 rps
96. Mistral 7B v0.1 f16 ☁️	0	9	48	53	52	12	29	0.87 €	1.23 rps
97. Mistral 7B Instruct v0.2 f16 ☁️	11	30	54	12	58	8	29	0.96 €	1.12 rps
98. Google Gemma 2B IT f16 ⚠️	33	28	16	57	15	20	28	0.30 €	3.54 rps
99. Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️	5	34	30	11	47	8	22	0.82 €	1.32 rps
100. Orca 2 7B f16 ⚠️	22	0	26	20	52	4	21	0.78 €	1.38 rps
101. Google Gemma 7B IT f16 ⚠️	0	0	0	9	62	0	12	0.99 €	1.08 rps
102. Meta Llama2 7B f16🦙	0	5	22	3	28	2	10	0.95 €	1.13 rps
103. Yi 1.5 9B Chat f16 ⚠️	0	4	29	9	0	8	8	1.41 €	0.76 rps

Grok 2 Beta from X-AI

This wasn’t expected, but the second version of Grok from X-AI suddenly started making sense (the previous one was nearly useless). Grok 2 Beta made its way into TOP 15. The model performed overall quite well on tasks extracted from LLM products in our benchmark. Even its Reason capability is quite nice.

The model is running close to older versions of GPT-4, but it is still worse than the Qwen 2.5 72B instruct which you can download and run on your own hardware. Nonetheless the news is great. Pretty much any company can make it into the TOP-20 of our benchmark, if they have enough diverse data and access to compute capability for the training.

Gemini 1.5 Flash 8B - Future Perfect

In the LLM Benchmark for September we’ve talked about new models in Llama 3.2 series. They have really pushed state of the art for the local models back then. The progress doesn’t stop there.

Google has released new Gemini 1.5 Flash model. It shows nice results on our product benchmark. This 8B model performs on the level of GPT3.5 or Llama 3 70B, almost catching up with the normal 1.5 Flash.

The model also illustrates the progress of Google in LLM development. This is the cheapest model that ranks quite high compared even to the other Gemini Pro LLMs released in previous months.

The biggest limitation of this model: it is closed. Even though we know its size - 8B, it isn’t possible to download the weights and run things locally.

However, Gemini 1.5 Flash can be used quite cheaply. Plus, as history tells us, whatever one company has achieved, another company could soon repeat. So we’ll be waiting for more small models of this quality, preferably locally-capable.

Claude Sonnet 3.5 and Haiku 3.5 - Getting better

Anthropic released updates to the two models in its lineup:

Medium: Sonnet 3.5
Small: Haiku 3.5

Sonnet 3.5 is currently the highest scoring model from Anthropic in our benchmark. It jumped to 11th place.

Compared to the previous version of Sonnet 3.5, this version shows improved instruction-following and enhanced capabilities with code, both in writing and handling more complex engineering tasks.

Claude 3.5 Sonnet v2 is overall a decent model, but you can get better quality for lower price. For example, by using GPT-4o or running a local Qwen 2.5.

Claude 3.5 Haiku is another improvement in the Haiku series. The model has improved scores across the board (except the Code+Engineering category). The biggest jump was in Reason: from 35 to 68! This is the highest Reason score for all Anthropic models. Could this point towards a new architecture in the next Claude series?

Additional facts to support this theory: Haiku model was the last one to come out, plus it costs 4x times more than the previous Haiku version. Cost structure changes in LLMs are usually aligned with the underlying architectural changes.

Because of the price hike, Haiku is no longer in the “smart and extremely cheap” category. At this point you can find better models like GPT-4o mini or Google Gemini 1.5 Flash 8B.

Overall trend of quality increases within the model ranges - continues. Let’s see if the improved Reason will show up in the other model releases from Anthropic.

Trends

Speaking of the trends, take a look at this interesting meta-trend. OpenAI, Google and Sonnet within last two months have introduced new lower-tier models into higher pricing tiers. This make the charts look as is the LLM performance is actually degrading within these tiers.

This could potentially mean a combination of three things:

LLM Providers are starting to optimise their price offerings based on Cost and usage.
It isn’t anymore possible to compete on quality without starting to increase compute resources (could we be hitting the limits of transformers architecture?)
Our price brackets for categories were not chosen well. We’ll need to redo the entire chart.

And if we plot Gemini 1.5 Flash 8B on the map of locally-capable models, the picture looks like the one below, marking a nice performance jump in the State-of-the-Art.

Let’s see how things continue into November 2024. We will keep you updated!

Transform Your Digital Projects with the Best AI Language Models!

Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.

Vorname

Nachname *

Unternehmen *

E-Mail *

Telefonnummer

Ihre Nachricht *

* required

Wir verwenden die von Ihnen an uns gesendeten Angaben nur, um auf Ihren Wunsch hin mit Ihnen Kontakt im Zusammenhang mit Ihrer Anfrage aufzunehmen. Alle weiteren Informationen können Sie unseren Datenschutzhinweisen entnehmen.

Martin Warnung

Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80

Contact

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Insights 3/17/25

LLM Benchmarks: February 2025

Discover the latest insights from our independent LLM benchmarks for February 2025. Find out which large language models performed best.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Wissen 4/30/24

GPT & Co: The best language models for digital products

Our analysis based on real benchmark data reveals which solutions excel in document processing, CRM integration, external integration, marketing support and code generation. Find your ideal model!

Insights

LLM Benchmarks March 2025

What's new in the world of LLMs? Find out and read why Google DeepMind managed to surprise us more than once last month.

Blog 11/10/23

Part 1: Data Analysis with ChatGPT

In this new blog series we will give you an overview of how to analyze and visualize data, create code manually and how to make ChatGPT work effectively. Part 1 deals with the following: In the data-driven era, businesses and organizations are constantly seeking ways to extract meaningful insights from their data. One powerful tool that can facilitate this process is ChatGPT, a state-of-the-art natural language processing model developed by OpenAI. In Part 1 pf this blog, we'll explore the proper usage of data analysis with ChatGPT and how it can help you make the most of your data.

Blog 3/11/25

Answering Business Questions with LLMs

8th place in Enterprise RAG Challenge 2025: Answering Business Questions with LLMs

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 5/17/24

8 tips for developing AI assistants

8 practical tips for implementing AI assistants

Blog 4/28/23

Creating a Social Media Posts Generator Website with ChatGPT

Using the GPT-3-turbo and DALL-E models in Node.js to create a social post generator for a fictional product can be really helpful. The author uses ChatGPT to create an API that utilizes the openai library for Node.js., a Vue component with an input for the title and message of the post. This article provides step-by-step instructions for setting up the project and includes links to the code repository.

Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Blog

Celebrating Homai - Using AI for Good

Our colleague Aigiz Kunafin has achieved an outstanding milestone - importance of his side-project Homai was acknowledged by the “AI for Good” Initiative of United Nations.