The best Large Language Models of October 2024

The TIMETOACT GROUP LLM Benchmarks highlight the most powerful AI language models for digital product development. Discover which large language models performed best in october.

We have a few new models to talk about, so let’s get started:

Grok2 from X.AI - Suddenly in TOP 15
Gemini 1.5 Flash 8B - Future Perfect
New Claude Sonnet 3.5 and Haiku 3.5 - Getting better

LLM Benchmarks | Oktober 2024

The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Code

Can the model generate code and help with programming?

Cost

The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

CRM

How well does the model support work with product catalogs and marketplaces?

Docs

How well can the model work with large documents and knowledge bases?

Integrate

Can the model easily interact with external APIs, services and plugins?

Marketing

How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

Reason

How well can the model reason and draw conclusions in a given context?

Speed

The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Hide Cost

Model	Code	Crm	Docs	Integrate	Marketing	Reason	Final	cost	Speed
1. GPT o1-preview v1/2024-09-12 ☁️	95	92	94	96	88	87	92	52.32 €	0.08 rps
2. GPT o1-mini v1/2024-09-12 ☁️	93	96	94	85	82	87	90	8.15 €	0.16 rps
3. Google Gemini 1.5 Pro v2 ☁️	86	97	94	100	78	74	88	1.00 €	1.18 rps
4. GPT-4o v1/2024-05-13 ☁️	90	96	100	89	78	74	88	1.21 €	1.44 rps
5. GPT-4o v3/dyn-2024-08-13 ☁️	90	97	100	81	79	78	88	1.22 €	1.21 rps
6. GPT-4 Turbo v5/2024-04-09 ☁️	86	99	98	100	88	43	86	2.45 €	0.84 rps
7. GPT-4o v2/2024-08-06 ☁️	90	84	97	92	82	59	84	0.63 €	1.49 rps
8. Google Gemini 1.5 Pro 0801 ☁️	84	92	79	100	70	74	83	0.90 €	0.83 rps
9. Qwen 2.5 72B Instruct ⚠️	79	92	94	100	71	59	83	0.10 €	0.66 rps
10. Llama 3.1 405B Hermes 3🦙	68	93	89	100	88	53	82	0.54 €	0.49 rps
11. Claude 3.5 Sonnet v2 ☁️	82	97	93	85	71	57	81	0.95 €	0.09 rps
12. GPT-4 v1/0314 ☁️	90	88	98	70	88	45	80	7.04 €	1.31 rps
13. X-AI Grok 2 ⚠️	63	93	87	89	88	58	79	1.03 €	0.31 rps
14. GPT-4 v2/0613 ☁️	90	83	95	70	88	45	78	7.04 €	2.16 rps
15. Claude 3 Opus ☁️	69	88	100	78	76	58	78	4.69 €	0.41 rps
16. Claude 3.5 Sonnet v1 ☁️	72	83	89	85	80	58	78	0.94 €	0.09 rps
17. GPT-4 Turbo v4/0125-preview ☁️	66	97	100	85	75	43	78	2.45 €	0.84 rps
18. GPT-4o Mini ☁️	63	87	80	70	100	65	78	0.04 €	1.46 rps
19. Meta Llama3.1 405B Instruct🦙	81	93	92	70	75	48	76	2.39 €	1.16 rps
20. GPT-4 Turbo v3/1106-preview ☁️	66	75	98	70	88	60	76	2.46 €	0.68 rps
21. DeepSeek v2.5 236B ⚠️	57	80	91	78	88	57	75	0.03 €	0.42 rps
22. Google Gemini 1.5 Flash v2 ☁️	64	96	89	75	81	44	75	0.06 €	2.01 rps
23. Google Gemini 1.5 Pro 0409 ☁️	68	97	96	85	75	26	74	0.95 €	0.59 rps
24. Meta Llama 3.1 70B Instruct f16🦙	74	89	90	70	75	48	74	1.79 €	0.90 rps
25. Google Gemini Flash 1.5 8B ☁️	70	93	78	69	76	48	72	0.01 €	1.19 rps
26. GPT-3.5 v2/0613 ☁️	68	81	73	81	81	50	72	0.34 €	1.46 rps
27. Meta Llama 3 70B Instruct🦙	81	83	84	60	81	45	72	0.06 €	0.85 rps
28. Mistral Large 123B v2/2407 ☁️	68	79	68	75	75	70	72	0.86 €	1.02 rps
29. Google Gemini 1.5 Pro 0514 ☁️	73	96	79	100	25	60	72	1.07 €	0.92 rps
30. Google Gemini 1.5 Flash 0514 ☁️	32	97	100	75	72	52	71	0.06 €	1.77 rps
31. Google Gemini 1.0 Pro ☁️	66	86	83	78	88	28	71	0.37 €	1.36 rps
32. Meta Llama 3.2 90B Vision🦙	74	84	87	78	71	32	71	0.23 €	1.10 rps
33. GPT-3.5 v3/1106 ☁️	68	70	71	78	78	58	70	0.24 €	2.33 rps
34. GPT-3.5 v4/0125 ☁️	63	87	71	78	78	43	70	0.12 €	1.43 rps
35. Claude 3.5 Haiku ☁️	52	80	72	70	75	68	70	0.32 €	1.24 rps
36. Qwen1.5 32B Chat f16 ⚠️	70	90	82	78	78	20	69	0.97 €	1.66 rps
37. Cohere Command R+ ☁️	63	80	76	70	70	58	69	0.83 €	1.90 rps
38. Gemma 2 27B IT ⚠️	61	72	87	70	89	32	69	0.07 €	0.90 rps
39. Mistral 7B OpenChat-3.5 v3 0106 f16 ✅	68	87	67	70	88	25	67	0.32 €	3.39 rps
40. Gemma 7B OpenChat-3.5 v3 0106 f16 ✅	63	67	84	60	81	46	67	0.21 €	5.09 rps
41. Meta Llama 3 8B Instruct f16🦙	79	62	68	70	80	41	67	0.32 €	3.33 rps
42. Mistral 7B OpenChat-3.5 v2 1210 f16 ✅	63	73	72	69	88	30	66	0.32 €	3.40 rps
43. Mistral 7B OpenChat-3.5 v1 f16 ✅	58	72	72	70	88	33	65	0.49 €	2.20 rps
44. GPT-3.5-instruct 0914 ☁️	47	92	69	62	88	33	65	0.35 €	2.15 rps
45. GPT-3.5 v1/0301 ☁️	55	82	69	78	82	26	65	0.35 €	4.12 rps
46. Llama 3 8B OpenChat-3.6 20240522 f16 ✅	76	51	76	60	88	38	65	0.28 €	3.79 rps
47. Mistral Nemo 12B v1/2407 ☁️	54	58	51	100	75	49	64	0.03 €	1.22 rps
48. Meta Llama 3.2 11B Vision🦙	70	71	65	70	71	36	64	0.04 €	1.49 rps
49. Starling 7B-alpha f16 ⚠️	58	66	67	70	88	34	64	0.58 €	1.85 rps
50. Qwen 2.5 7B Instruct ⚠️	48	77	80	60	69	47	63	0.07 €	1.25 rps
51. Llama 3 8B Hermes 2 Theta🦙	61	73	74	70	85	16	63	0.05 €	0.55 rps
52. Yi 1.5 34B Chat f16 ⚠️	47	78	70	70	86	26	63	1.18 €	1.37 rps
53. Claude 3 Haiku ☁️	64	69	64	70	75	35	63	0.08 €	0.52 rps
54. Liquid: LFM 40B MoE ⚠️	72	69	65	60	82	24	62	0.00 €	1.45 rps
55. Meta Llama 3.1 8B Instruct f16🦙	57	74	62	70	74	32	61	0.45 €	2.41 rps
56. Qwen2 7B Instruct f32 ⚠️	50	81	81	60	66	31	61	0.46 €	2.36 rps
57. Mistral Small v3/2409 ☁️	43	75	71	75	75	26	61	0.06 €	0.81 rps
58. Claude 3 Sonnet ☁️	72	41	74	70	78	28	61	0.95 €	0.85 rps
59. Mixtral 8x22B API (Instruct) ☁️	53	62	62	100	75	7	60	0.17 €	3.12 rps
60. Mistral Pixtral 12B ✅	53	69	73	60	64	40	60	0.03 €	0.83 rps
61. Codestral Mamba 7B v1 ✅	53	66	51	100	71	17	60	0.30 €	2.82 rps
62. Inflection 3 Productivity ⚠️	46	59	39	70	79	61	59	0.92 €	0.17 rps
63. Anthropic Claude Instant v1.2 ☁️	58	75	65	75	65	16	59	2.10 €	1.49 rps
64. Cohere Command R ☁️	45	66	57	70	84	27	58	0.13 €	2.50 rps
65. Anthropic Claude v2.0 ☁️	63	52	55	60	84	34	58	2.19 €	0.40 rps
66. Qwen1.5 7B Chat f16 ⚠️	56	81	60	50	60	36	57	0.29 €	3.76 rps
67. Mistral Large v1/2402 ☁️	37	49	70	78	84	25	57	0.58 €	2.11 rps
68. Microsoft WizardLM 2 8x22B ⚠️	48	76	79	50	62	22	56	0.13 €	0.70 rps
69. Qwen1.5 14B Chat f16 ⚠️	50	58	51	70	84	22	56	0.36 €	3.03 rps
70. MistralAI Ministral 8B ✅	56	55	41	85	68	30	56	0.02 €	1.02 rps
71. MistralAI Ministral 3B ✅	50	48	39	92	60	41	55	0.01 €	1.02 rps
72. Anthropic Claude v2.1 ☁️	29	58	59	78	75	32	55	2.25 €	0.35 rps
73. Llama2 13B Vicuna-1.5 f16🦙	50	37	55	60	82	37	53	0.99 €	1.09 rps
74. Mistral 7B Instruct v0.1 f16 ☁️	34	71	69	59	62	23	53	0.75 €	1.43 rps
75. Mistral 7B OpenOrca f16 ☁️	54	57	76	25	78	27	53	0.41 €	2.65 rps
76. Meta Llama 3.2 3B🦙	52	71	66	70	44	14	53	0.01 €	1.25 rps
77. Google Recurrent Gemma 9B IT f16 ⚠️	58	27	71	60	56	23	49	0.89 €	1.21 rps
78. Codestral 22B v1 ✅	38	47	44	78	66	13	48	0.06 €	4.03 rps
79. Llama2 13B Hermes f16🦙	50	24	37	74	60	42	48	1.00 €	1.07 rps
80. IBM Granite 34B Code Instruct f16 ☁️	63	49	34	70	57	7	47	1.07 €	1.51 rps
81. Mistral Small v2/2402 ☁️	33	42	45	92	56	8	46	0.06 €	3.21 rps
82. DBRX 132B Instruct ⚠️	43	39	43	77	59	10	45	0.26 €	1.31 rps
83. NVIDIA Llama 3.1 Nemotron 70B Instruct🦙	68	54	25	74	28	21	45	0.09 €	0.53 rps
84. Mistral Medium v1/2312 ☁️	41	43	44	61	62	12	44	0.81 €	0.35 rps
85. Meta Llama 3.2 1B🦙	32	40	33	40	68	51	44	0.02 €	1.69 rps
86. Llama2 13B Puffin f16🦙	37	15	44	70	56	39	43	4.70 €	0.23 rps
87. Mistral Small v1/2312 (Mixtral) ☁️	10	67	63	52	56	8	43	0.06 €	2.21 rps
88. Microsoft WizardLM 2 7B ⚠️	53	34	42	59	53	13	42	0.02 €	0.89 rps
89. Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️	22	47	59	38	62	8	39	0.05 €	2.39 rps
90. Gemma 2 9B IT ⚠️	45	25	47	34	68	13	38	0.02 €	0.88 rps
91. Meta Llama2 13B chat f16🦙	22	38	17	60	75	6	36	0.75 €	1.44 rps
92. Mistral 7B Zephyr-β f16 ✅	37	34	46	59	29	4	35	0.46 €	2.34 rps
93. Meta Llama2 7B chat f16🦙	22	33	20	60	50	18	34	0.56 €	1.93 rps
94. Mistral 7B Notus-v1 f16 ⚠️	10	54	25	52	48	4	32	0.75 €	1.43 rps
95. Orca 2 13B f16 ⚠️	18	22	32	22	67	20	30	0.95 €	1.14 rps
96. Mistral 7B v0.1 f16 ☁️	0	9	48	53	52	12	29	0.87 €	1.23 rps
97. Mistral 7B Instruct v0.2 f16 ☁️	11	30	54	12	58	8	29	0.96 €	1.12 rps
98. Google Gemma 2B IT f16 ⚠️	33	28	16	57	15	20	28	0.30 €	3.54 rps
99. Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️	5	34	30	11	47	8	22	0.82 €	1.32 rps
100. Orca 2 7B f16 ⚠️	22	0	26	20	52	4	21	0.78 €	1.38 rps
101. Google Gemma 7B IT f16 ⚠️	0	0	0	9	62	0	12	0.99 €	1.08 rps
102. Meta Llama2 7B f16🦙	0	5	22	3	28	2	10	0.95 €	1.13 rps
103. Yi 1.5 9B Chat f16 ⚠️	0	4	29	9	0	8	8	1.41 €	0.76 rps

Grok 2 Beta from X-AI

This wasn’t expected, but the second version of Grok from X-AI suddenly started making sense (the previous one was nearly useless). Grok 2 Beta made its way into TOP 15. The model performed overall quite well on tasks extracted from LLM products in our benchmark. Even its Reason capability is quite nice.

The model is running close to older versions of GPT-4, but it is still worse than the Qwen 2.5 72B instruct which you can download and run on your own hardware. Nonetheless the news is great. Pretty much any company can make it into the TOP-20 of our benchmark, if they have enough diverse data and access to compute capability for the training.

Gemini 1.5 Flash 8B - Future Perfect

In the LLM Benchmark for September we’ve talked about new models in Llama 3.2 series. They have really pushed state of the art for the local models back then. The progress doesn’t stop there.

Google has released new Gemini 1.5 Flash model. It shows nice results on our product benchmark. This 8B model performs on the level of GPT3.5 or Llama 3 70B, almost catching up with the normal 1.5 Flash.

The model also illustrates the progress of Google in LLM development. This is the cheapest model that ranks quite high compared even to the other Gemini Pro LLMs released in previous months.

The biggest limitation of this model: it is closed. Even though we know its size - 8B, it isn’t possible to download the weights and run things locally.

However, Gemini 1.5 Flash can be used quite cheaply. Plus, as history tells us, whatever one company has achieved, another company could soon repeat. So we’ll be waiting for more small models of this quality, preferably locally-capable.

Claude Sonnet 3.5 and Haiku 3.5 - Getting better

Anthropic released updates to the two models in its lineup:

Medium: Sonnet 3.5
Small: Haiku 3.5

Sonnet 3.5 is currently the highest scoring model from Anthropic in our benchmark. It jumped to 11th place.

Compared to the previous version of Sonnet 3.5, this version shows improved instruction-following and enhanced capabilities with code, both in writing and handling more complex engineering tasks.

Claude 3.5 Sonnet v2 is overall a decent model, but you can get better quality for lower price. For example, by using GPT-4o or running a local Qwen 2.5.

Claude 3.5 Haiku is another improvement in the Haiku series. The model has improved scores across the board (except the Code+Engineering category). The biggest jump was in Reason: from 35 to 68! This is the highest Reason score for all Anthropic models. Could this point towards a new architecture in the next Claude series?

Additional facts to support this theory: Haiku model was the last one to come out, plus it costs 4x times more than the previous Haiku version. Cost structure changes in LLMs are usually aligned with the underlying architectural changes.

Because of the price hike, Haiku is no longer in the “smart and extremely cheap” category. At this point you can find better models like GPT-4o mini or Google Gemini 1.5 Flash 8B.

Overall trend of quality increases within the model ranges - continues. Let’s see if the improved Reason will show up in the other model releases from Anthropic.

Trends

Speaking of the trends, take a look at this interesting meta-trend. OpenAI, Google and Sonnet within last two months have introduced new lower-tier models into higher pricing tiers. This make the charts look as is the LLM performance is actually degrading within these tiers.

This could potentially mean a combination of three things:

LLM Providers are starting to optimise their price offerings based on Cost and usage.
It isn’t anymore possible to compete on quality without starting to increase compute resources (could we be hitting the limits of transformers architecture?)
Our price brackets for categories were not chosen well. We’ll need to redo the entire chart.

And if we plot Gemini 1.5 Flash 8B on the map of locally-capable models, the picture looks like the one below, marking a nice performance jump in the State-of-the-Art.

Let’s see how things continue into November 2024. We will keep you updated!

Transform Your Digital Projects with the Best AI Language Models!

Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.

Vorname

Nachname *

Unternehmen *

E-Mail *

Telefonnummer

Ihre Nachricht *

* required

Wir verwenden die von Ihnen an uns gesendeten Angaben nur, um auf Ihren Wunsch hin mit Ihnen Kontakt im Zusammenhang mit Ihrer Anfrage aufzunehmen. Alle weiteren Informationen können Sie unseren Datenschutzhinweisen entnehmen.

Martin Warnung

Sales Consultant TIMETOACT GROUP Österreich GmbH +43 664 881 788 80

Contact

Blog 12/4/24

ChatGPT & Co: LLM Benchmarks for November

Find out which large language models outperformed in the November 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 1/7/25

ChatGPT & Co: LLM Benchmarks for December

Find out which large language models outperformed in the December 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Blog 10/1/24

ChatGPT & Co: LLM Benchmarks for September

Find out which large language models outperformed in the September 2024 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 7/30/24

LLM-Benchmarks July 2024

This LLM Leaderboard from July 2024 helps to find the best Large Language Model for digital product development.

Wissen 6/30/24

LLM-Benchmarks June 2024

This LLM Leaderboard from june 2024 helps to find the best Large Language Model for digital product development.

Wissen 5/30/24

LLM-Benchmarks May 2024

This LLM Leaderboard from may 2024 helps to find the best Large Language Model for digital product development.

Wissen 4/30/24

LLM-Benchmarks April 2024

This LLM Leaderboard from april 2024 helps to find the best Large Language Model for digital product development.

Blog 2/3/25

ChatGPT & Co: LLM Benchmarks for January

Find out which large language models outperformed in the January 2025 benchmarks. Stay informed on the latest AI developments and performance metrics.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Blog 11/5/24

AIM Hackathon 2024: Sustainability Meets LLMs

Focusing on impactful AI applications, participants addressed key issues like greenwashing detection, ESG report relevance mapping, and compliance with the European Green Deal.

Blog 6/22/23

Strategic Impact of Large Language Models

This blog discusses the rapid advancements in large language models, particularly highlighting the impact of OpenAI's GPT models.

Blog 11/24/23

Part 3: How to Analyze a Database File with GPT-3.5

In this blog, we'll explore the proper usage of data analysis with ChatGPT and how you can analyze and visualize data from a SQLite database to help you make the most of your data.

Blog 10/29/24

Third Place - AIM Hackathon 2024: The Venturers

ESG reports are often filled with vague statements, obscuring key facts investors need. This team created an AI prototype that analyzes these reports sentence-by-sentence, categorizing content to produce a "relevance map".

Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Wissen 4/30/24

GPT & Co: The best language models for digital products

Our analysis based on real benchmark data reveals which solutions excel in document processing, CRM integration, external integration, marketing support and code generation. Find your ideal model!

Blog 5/17/24

8 tips for developing AI assistants

8 practical tips for implementing AI assistants

Blog 5/16/24

Common Mistakes in the Development of AI Assistants

We share how failures when implementing AI occurr and what can be learned from them for future projects: So that AI assistants can be implemented more successfully in the future!

Blog 1/21/25

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Blog 7/25/23

Revolutionizing the Logistics Industry

As the logistics industry becomes increasingly complex, businesses need innovative solutions to manage the challenges of supply chain management, trucking, and delivery. With competitors investing in cutting-edge research and development, it is vital for companies to stay ahead of the curve and embrace the latest technologies to remain competitive. That is why we introduce the TIMETOACT Logistics Simulator Framework, a revolutionary tool for creating a digital twin of your logistics operation.