We have a few new models to talk about, so let’s get started:
Grok2 from X.AI - Suddenly in TOP 15
Gemini 1.5 Flash 8B - Future Perfect
New Claude Sonnet 3.5 and Haiku 3.5 - Getting better
LLM Benchmarks | Oktober 2024
The benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.
☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license
Can the model generate code and help with programming?
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
How well does the model support work with product catalogs and marketplaces?
How well can the model work with large documents and knowledge bases?
Can the model easily interact with external APIs, services and plugins?
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
How well can the model reason and draw conclusions in a given context?
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.
Model | Code | Crm | Docs | Integrate | Marketing | Reason | Final | cost | Speed |
---|---|---|---|---|---|---|---|---|---|
1. GPT o1-preview v1/2024-09-12 ☁️ | 95 | 92 | 94 | 96 | 88 | 87 | 92 | 52.32 € | 0.08 rps |
2. GPT o1-mini v1/2024-09-12 ☁️ | 93 | 96 | 94 | 85 | 82 | 87 | 90 | 8.15 € | 0.16 rps |
3. Google Gemini 1.5 Pro v2 ☁️ | 86 | 97 | 94 | 100 | 78 | 74 | 88 | 1.00 € | 1.18 rps |
4. GPT-4o v1/2024-05-13 ☁️ | 90 | 96 | 100 | 89 | 78 | 74 | 88 | 1.21 € | 1.44 rps |
5. GPT-4o v3/dyn-2024-08-13 ☁️ | 90 | 97 | 100 | 81 | 79 | 78 | 88 | 1.22 € | 1.21 rps |
6. GPT-4 Turbo v5/2024-04-09 ☁️ | 86 | 99 | 98 | 100 | 88 | 43 | 86 | 2.45 € | 0.84 rps |
7. GPT-4o v2/2024-08-06 ☁️ | 90 | 84 | 97 | 92 | 82 | 59 | 84 | 0.63 € | 1.49 rps |
8. Google Gemini 1.5 Pro 0801 ☁️ | 84 | 92 | 79 | 100 | 70 | 74 | 83 | 0.90 € | 0.83 rps |
9. Qwen 2.5 72B Instruct ⚠️ | 79 | 92 | 94 | 100 | 71 | 59 | 83 | 0.10 € | 0.66 rps |
10. Llama 3.1 405B Hermes 3🦙 | 68 | 93 | 89 | 100 | 88 | 53 | 82 | 0.54 € | 0.49 rps |
11. Claude 3.5 Sonnet v2 ☁️ | 82 | 97 | 93 | 85 | 71 | 57 | 81 | 0.95 € | 0.09 rps |
12. GPT-4 v1/0314 ☁️ | 90 | 88 | 98 | 70 | 88 | 45 | 80 | 7.04 € | 1.31 rps |
13. X-AI Grok 2 ⚠️ | 63 | 93 | 87 | 89 | 88 | 58 | 79 | 1.03 € | 0.31 rps |
14. GPT-4 v2/0613 ☁️ | 90 | 83 | 95 | 70 | 88 | 45 | 78 | 7.04 € | 2.16 rps |
15. Claude 3 Opus ☁️ | 69 | 88 | 100 | 78 | 76 | 58 | 78 | 4.69 € | 0.41 rps |
16. Claude 3.5 Sonnet v1 ☁️ | 72 | 83 | 89 | 85 | 80 | 58 | 78 | 0.94 € | 0.09 rps |
17. GPT-4 Turbo v4/0125-preview ☁️ | 66 | 97 | 100 | 85 | 75 | 43 | 78 | 2.45 € | 0.84 rps |
18. GPT-4o Mini ☁️ | 63 | 87 | 80 | 70 | 100 | 65 | 78 | 0.04 € | 1.46 rps |
19. Meta Llama3.1 405B Instruct🦙 | 81 | 93 | 92 | 70 | 75 | 48 | 76 | 2.39 € | 1.16 rps |
20. GPT-4 Turbo v3/1106-preview ☁️ | 66 | 75 | 98 | 70 | 88 | 60 | 76 | 2.46 € | 0.68 rps |
21. DeepSeek v2.5 236B ⚠️ | 57 | 80 | 91 | 78 | 88 | 57 | 75 | 0.03 € | 0.42 rps |
22. Google Gemini 1.5 Flash v2 ☁️ | 64 | 96 | 89 | 75 | 81 | 44 | 75 | 0.06 € | 2.01 rps |
23. Google Gemini 1.5 Pro 0409 ☁️ | 68 | 97 | 96 | 85 | 75 | 26 | 74 | 0.95 € | 0.59 rps |
24. Meta Llama 3.1 70B Instruct f16🦙 | 74 | 89 | 90 | 70 | 75 | 48 | 74 | 1.79 € | 0.90 rps |
25. Google Gemini Flash 1.5 8B ☁️ | 70 | 93 | 78 | 69 | 76 | 48 | 72 | 0.01 € | 1.19 rps |
26. GPT-3.5 v2/0613 ☁️ | 68 | 81 | 73 | 81 | 81 | 50 | 72 | 0.34 € | 1.46 rps |
27. Meta Llama 3 70B Instruct🦙 | 81 | 83 | 84 | 60 | 81 | 45 | 72 | 0.06 € | 0.85 rps |
28. Mistral Large 123B v2/2407 ☁️ | 68 | 79 | 68 | 75 | 75 | 70 | 72 | 0.86 € | 1.02 rps |
29. Google Gemini 1.5 Pro 0514 ☁️ | 73 | 96 | 79 | 100 | 25 | 60 | 72 | 1.07 € | 0.92 rps |
30. Google Gemini 1.5 Flash 0514 ☁️ | 32 | 97 | 100 | 75 | 72 | 52 | 71 | 0.06 € | 1.77 rps |
31. Google Gemini 1.0 Pro ☁️ | 66 | 86 | 83 | 78 | 88 | 28 | 71 | 0.37 € | 1.36 rps |
32. Meta Llama 3.2 90B Vision🦙 | 74 | 84 | 87 | 78 | 71 | 32 | 71 | 0.23 € | 1.10 rps |
33. GPT-3.5 v3/1106 ☁️ | 68 | 70 | 71 | 78 | 78 | 58 | 70 | 0.24 € | 2.33 rps |
34. GPT-3.5 v4/0125 ☁️ | 63 | 87 | 71 | 78 | 78 | 43 | 70 | 0.12 € | 1.43 rps |
35. Claude 3.5 Haiku ☁️ | 52 | 80 | 72 | 70 | 75 | 68 | 70 | 0.32 € | 1.24 rps |
36. Qwen1.5 32B Chat f16 ⚠️ | 70 | 90 | 82 | 78 | 78 | 20 | 69 | 0.97 € | 1.66 rps |
37. Cohere Command R+ ☁️ | 63 | 80 | 76 | 70 | 70 | 58 | 69 | 0.83 € | 1.90 rps |
38. Gemma 2 27B IT ⚠️ | 61 | 72 | 87 | 70 | 89 | 32 | 69 | 0.07 € | 0.90 rps |
39. Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ | 68 | 87 | 67 | 70 | 88 | 25 | 67 | 0.32 € | 3.39 rps |
40. Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ | 63 | 67 | 84 | 60 | 81 | 46 | 67 | 0.21 € | 5.09 rps |
41. Meta Llama 3 8B Instruct f16🦙 | 79 | 62 | 68 | 70 | 80 | 41 | 67 | 0.32 € | 3.33 rps |
42. Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ | 63 | 73 | 72 | 69 | 88 | 30 | 66 | 0.32 € | 3.40 rps |
43. Mistral 7B OpenChat-3.5 v1 f16 ✅ | 58 | 72 | 72 | 70 | 88 | 33 | 65 | 0.49 € | 2.20 rps |
44. GPT-3.5-instruct 0914 ☁️ | 47 | 92 | 69 | 62 | 88 | 33 | 65 | 0.35 € | 2.15 rps |
45. GPT-3.5 v1/0301 ☁️ | 55 | 82 | 69 | 78 | 82 | 26 | 65 | 0.35 € | 4.12 rps |
46. Llama 3 8B OpenChat-3.6 20240522 f16 ✅ | 76 | 51 | 76 | 60 | 88 | 38 | 65 | 0.28 € | 3.79 rps |
47. Mistral Nemo 12B v1/2407 ☁️ | 54 | 58 | 51 | 100 | 75 | 49 | 64 | 0.03 € | 1.22 rps |
48. Meta Llama 3.2 11B Vision🦙 | 70 | 71 | 65 | 70 | 71 | 36 | 64 | 0.04 € | 1.49 rps |
49. Starling 7B-alpha f16 ⚠️ | 58 | 66 | 67 | 70 | 88 | 34 | 64 | 0.58 € | 1.85 rps |
50. Qwen 2.5 7B Instruct ⚠️ | 48 | 77 | 80 | 60 | 69 | 47 | 63 | 0.07 € | 1.25 rps |
51. Llama 3 8B Hermes 2 Theta🦙 | 61 | 73 | 74 | 70 | 85 | 16 | 63 | 0.05 € | 0.55 rps |
52. Yi 1.5 34B Chat f16 ⚠️ | 47 | 78 | 70 | 70 | 86 | 26 | 63 | 1.18 € | 1.37 rps |
53. Claude 3 Haiku ☁️ | 64 | 69 | 64 | 70 | 75 | 35 | 63 | 0.08 € | 0.52 rps |
54. Liquid: LFM 40B MoE ⚠️ | 72 | 69 | 65 | 60 | 82 | 24 | 62 | 0.00 € | 1.45 rps |
55. Meta Llama 3.1 8B Instruct f16🦙 | 57 | 74 | 62 | 70 | 74 | 32 | 61 | 0.45 € | 2.41 rps |
56. Qwen2 7B Instruct f32 ⚠️ | 50 | 81 | 81 | 60 | 66 | 31 | 61 | 0.46 € | 2.36 rps |
57. Mistral Small v3/2409 ☁️ | 43 | 75 | 71 | 75 | 75 | 26 | 61 | 0.06 € | 0.81 rps |
58. Claude 3 Sonnet ☁️ | 72 | 41 | 74 | 70 | 78 | 28 | 61 | 0.95 € | 0.85 rps |
59. Mixtral 8x22B API (Instruct) ☁️ | 53 | 62 | 62 | 100 | 75 | 7 | 60 | 0.17 € | 3.12 rps |
60. Mistral Pixtral 12B ✅ | 53 | 69 | 73 | 60 | 64 | 40 | 60 | 0.03 € | 0.83 rps |
61. Codestral Mamba 7B v1 ✅ | 53 | 66 | 51 | 100 | 71 | 17 | 60 | 0.30 € | 2.82 rps |
62. Inflection 3 Productivity ⚠️ | 46 | 59 | 39 | 70 | 79 | 61 | 59 | 0.92 € | 0.17 rps |
63. Anthropic Claude Instant v1.2 ☁️ | 58 | 75 | 65 | 75 | 65 | 16 | 59 | 2.10 € | 1.49 rps |
64. Cohere Command R ☁️ | 45 | 66 | 57 | 70 | 84 | 27 | 58 | 0.13 € | 2.50 rps |
65. Anthropic Claude v2.0 ☁️ | 63 | 52 | 55 | 60 | 84 | 34 | 58 | 2.19 € | 0.40 rps |
66. Qwen1.5 7B Chat f16 ⚠️ | 56 | 81 | 60 | 50 | 60 | 36 | 57 | 0.29 € | 3.76 rps |
67. Mistral Large v1/2402 ☁️ | 37 | 49 | 70 | 78 | 84 | 25 | 57 | 0.58 € | 2.11 rps |
68. Microsoft WizardLM 2 8x22B ⚠️ | 48 | 76 | 79 | 50 | 62 | 22 | 56 | 0.13 € | 0.70 rps |
69. Qwen1.5 14B Chat f16 ⚠️ | 50 | 58 | 51 | 70 | 84 | 22 | 56 | 0.36 € | 3.03 rps |
70. MistralAI Ministral 8B ✅ | 56 | 55 | 41 | 85 | 68 | 30 | 56 | 0.02 € | 1.02 rps |
71. MistralAI Ministral 3B ✅ | 50 | 48 | 39 | 92 | 60 | 41 | 55 | 0.01 € | 1.02 rps |
72. Anthropic Claude v2.1 ☁️ | 29 | 58 | 59 | 78 | 75 | 32 | 55 | 2.25 € | 0.35 rps |
73. Llama2 13B Vicuna-1.5 f16🦙 | 50 | 37 | 55 | 60 | 82 | 37 | 53 | 0.99 € | 1.09 rps |
74. Mistral 7B Instruct v0.1 f16 ☁️ | 34 | 71 | 69 | 59 | 62 | 23 | 53 | 0.75 € | 1.43 rps |
75. Mistral 7B OpenOrca f16 ☁️ | 54 | 57 | 76 | 25 | 78 | 27 | 53 | 0.41 € | 2.65 rps |
76. Meta Llama 3.2 3B🦙 | 52 | 71 | 66 | 70 | 44 | 14 | 53 | 0.01 € | 1.25 rps |
77. Google Recurrent Gemma 9B IT f16 ⚠️ | 58 | 27 | 71 | 60 | 56 | 23 | 49 | 0.89 € | 1.21 rps |
78. Codestral 22B v1 ✅ | 38 | 47 | 44 | 78 | 66 | 13 | 48 | 0.06 € | 4.03 rps |
79. Llama2 13B Hermes f16🦙 | 50 | 24 | 37 | 74 | 60 | 42 | 48 | 1.00 € | 1.07 rps |
80. IBM Granite 34B Code Instruct f16 ☁️ | 63 | 49 | 34 | 70 | 57 | 7 | 47 | 1.07 € | 1.51 rps |
81. Mistral Small v2/2402 ☁️ | 33 | 42 | 45 | 92 | 56 | 8 | 46 | 0.06 € | 3.21 rps |
82. DBRX 132B Instruct ⚠️ | 43 | 39 | 43 | 77 | 59 | 10 | 45 | 0.26 € | 1.31 rps |
83. NVIDIA Llama 3.1 Nemotron 70B Instruct🦙 | 68 | 54 | 25 | 74 | 28 | 21 | 45 | 0.09 € | 0.53 rps |
84. Mistral Medium v1/2312 ☁️ | 41 | 43 | 44 | 61 | 62 | 12 | 44 | 0.81 € | 0.35 rps |
85. Meta Llama 3.2 1B🦙 | 32 | 40 | 33 | 40 | 68 | 51 | 44 | 0.02 € | 1.69 rps |
86. Llama2 13B Puffin f16🦙 | 37 | 15 | 44 | 70 | 56 | 39 | 43 | 4.70 € | 0.23 rps |
87. Mistral Small v1/2312 (Mixtral) ☁️ | 10 | 67 | 63 | 52 | 56 | 8 | 43 | 0.06 € | 2.21 rps |
88. Microsoft WizardLM 2 7B ⚠️ | 53 | 34 | 42 | 59 | 53 | 13 | 42 | 0.02 € | 0.89 rps |
89. Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ | 22 | 47 | 59 | 38 | 62 | 8 | 39 | 0.05 € | 2.39 rps |
90. Gemma 2 9B IT ⚠️ | 45 | 25 | 47 | 34 | 68 | 13 | 38 | 0.02 € | 0.88 rps |
91. Meta Llama2 13B chat f16🦙 | 22 | 38 | 17 | 60 | 75 | 6 | 36 | 0.75 € | 1.44 rps |
92. Mistral 7B Zephyr-β f16 ✅ | 37 | 34 | 46 | 59 | 29 | 4 | 35 | 0.46 € | 2.34 rps |
93. Meta Llama2 7B chat f16🦙 | 22 | 33 | 20 | 60 | 50 | 18 | 34 | 0.56 € | 1.93 rps |
94. Mistral 7B Notus-v1 f16 ⚠️ | 10 | 54 | 25 | 52 | 48 | 4 | 32 | 0.75 € | 1.43 rps |
95. Orca 2 13B f16 ⚠️ | 18 | 22 | 32 | 22 | 67 | 20 | 30 | 0.95 € | 1.14 rps |
96. Mistral 7B v0.1 f16 ☁️ | 0 | 9 | 48 | 53 | 52 | 12 | 29 | 0.87 € | 1.23 rps |
97. Mistral 7B Instruct v0.2 f16 ☁️ | 11 | 30 | 54 | 12 | 58 | 8 | 29 | 0.96 € | 1.12 rps |
98. Google Gemma 2B IT f16 ⚠️ | 33 | 28 | 16 | 57 | 15 | 20 | 28 | 0.30 € | 3.54 rps |
99. Microsoft Phi 3 Medium 4K Instruct 14B f16 ⚠️ | 5 | 34 | 30 | 11 | 47 | 8 | 22 | 0.82 € | 1.32 rps |
100. Orca 2 7B f16 ⚠️ | 22 | 0 | 26 | 20 | 52 | 4 | 21 | 0.78 € | 1.38 rps |
101. Google Gemma 7B IT f16 ⚠️ | 0 | 0 | 0 | 9 | 62 | 0 | 12 | 0.99 € | 1.08 rps |
102. Meta Llama2 7B f16🦙 | 0 | 5 | 22 | 3 | 28 | 2 | 10 | 0.95 € | 1.13 rps |
103. Yi 1.5 9B Chat f16 ⚠️ | 0 | 4 | 29 | 9 | 0 | 8 | 8 | 1.41 € | 0.76 rps |
Grok 2 Beta from X-AI
This wasn’t expected, but the second version of Grok from X-AI suddenly started making sense (the previous one was nearly useless). Grok 2 Beta made its way into TOP 15. The model performed overall quite well on tasks extracted from LLM products in our benchmark. Even its Reason capability is quite nice.
The model is running close to older versions of GPT-4, but it is still worse than the Qwen 2.5 72B instruct which you can download and run on your own hardware. Nonetheless the news is great. Pretty much any company can make it into the TOP-20 of our benchmark, if they have enough diverse data and access to compute capability for the training.
Gemini 1.5 Flash 8B - Future Perfect
In the LLM Benchmark for September we’ve talked about new models in Llama 3.2 series. They have really pushed state of the art for the local models back then. The progress doesn’t stop there.
Google has released new Gemini 1.5 Flash model. It shows nice results on our product benchmark. This 8B model performs on the level of GPT3.5 or Llama 3 70B, almost catching up with the normal 1.5 Flash.
The model also illustrates the progress of Google in LLM development. This is the cheapest model that ranks quite high compared even to the other Gemini Pro LLMs released in previous months.
The biggest limitation of this model: it is closed. Even though we know its size - 8B, it isn’t possible to download the weights and run things locally.
However, Gemini 1.5 Flash can be used quite cheaply. Plus, as history tells us, whatever one company has achieved, another company could soon repeat. So we’ll be waiting for more small models of this quality, preferably locally-capable.
Claude Sonnet 3.5 and Haiku 3.5 - Getting better
Anthropic released updates to the two models in its lineup:
Medium: Sonnet 3.5
Small: Haiku 3.5
Sonnet 3.5 is currently the highest scoring model from Anthropic in our benchmark. It jumped to 11th place.
Compared to the previous version of Sonnet 3.5, this version shows improved instruction-following and enhanced capabilities with code, both in writing and handling more complex engineering tasks.
Claude 3.5 Sonnet v2 is overall a decent model, but you can get better quality for lower price. For example, by using GPT-4o or running a local Qwen 2.5.
Claude 3.5 Haiku is another improvement in the Haiku series. The model has improved scores across the board (except the Code+Engineering category). The biggest jump was in Reason: from 35 to 68! This is the highest Reason score for all Anthropic models. Could this point towards a new architecture in the next Claude series?
Additional facts to support this theory: Haiku model was the last one to come out, plus it costs 4x times more than the previous Haiku version. Cost structure changes in LLMs are usually aligned with the underlying architectural changes.
Because of the price hike, Haiku is no longer in the “smart and extremely cheap” category. At this point you can find better models like GPT-4o mini or Google Gemini 1.5 Flash 8B.
Overall trend of quality increases within the model ranges - continues. Let’s see if the improved Reason will show up in the other model releases from Anthropic.
Trends
Speaking of the trends, take a look at this interesting meta-trend. OpenAI, Google and Sonnet within last two months have introduced new lower-tier models into higher pricing tiers. This make the charts look as is the LLM performance is actually degrading within these tiers.
This could potentially mean a combination of three things:
LLM Providers are starting to optimise their price offerings based on Cost and usage.
It isn’t anymore possible to compete on quality without starting to increase compute resources (could we be hitting the limits of transformers architecture?)
Our price brackets for categories were not chosen well. We’ll need to redo the entire chart.
And if we plot Gemini 1.5 Flash 8B on the map of locally-capable models, the picture looks like the one below, marking a nice performance jump in the State-of-the-Art.
Let’s see how things continue into November 2024. We will keep you updated!
Transform Your Digital Projects with the Best AI Language Models!
Discover the transformative power of the best LLMs and revolutionize your digital products with AI! Stay future-focused, boost efficiency, and gain a clear competitive edge. We help you elevate your business value to the next level.
Martin Warnung