AI support for MtxVec

It is possible to use AI LLM chats to convert existing code to MtxVec accelerated variant with very few errors and excellent performance. How?

We provide a condensed summary from over 250.000 lines of code, which our customers can use to create a context when working with AI chats. The size of this initial message to be passed to the AI uses only about 30.000 tokens. Once the chat knows about MtxVec, this allows automatic modificiations of your own code so that it uses MtxVec and can consequently run faster. Automatic block-processing and multi-threading are also supported. Simply ask the chat to create vectorized and block enabled code, or ask for mult-threaded variant with MtxVec immediately. Of course there will be errors in AI output also, but it can save you a ton of typing as well.

The AI is only useful, if you know exactly what you want to save you the time for looking up the declarations, the function names and for typing the code. Relying on the AI responses, if you don’t know what the result should be, is a bad thing!

Start with simple examples and then build your way up. We also updated MtxVec library API to match some of the ideas, that LLM AI wanted to use in its outputs. It matters, that you use latest MtxVec version for best results.

We had good experience with ChatGPT model o4 and o3-mini-high. Grok 3 has the fastest output and modest error rates. At the time of writing this (March 2025), Grok delivers best results. Use AI chat models with 128K context length or more. DeepSeek returned "server busy" most of the time. The capabilities and features of AI providers change quickly. Sometimes for the better or for the worse. It makes sense to use more than one subscription and not more than monthly.

Getting started

As the first message paste the content of

MtxVecPriorKnowledge.txt

Most chat engines will require paid subscription to parse this content. Dont pass a http link. Copy and paste the file content as a message. It may be needed to break this one long message in to smaller pieces for the AI engine to be willing to process its content. The http links can have content truncated by the chat engine considerably.

ChatGPT allows custom confg:

Dew Code Companion

However, the results in our tests were still poor (June 2025). ChatGPT Codex not exluded from this assesment.

Best results are currently delivered by grok.com, which has increased its advantage over competition and is currently leading the productivity pack with a perceived advantage of 5:1. One of the things we noticed with ChatGPT, but not with Grok, was that some variable names were not right. It worked 9 out of 10 times. Sometimes a sign was lost, sometimes something else. This nearly never happened to Grok. ChatGPT also developed a big apetite for simply making things up (june 2025). Using classes which dont exist, functions that dont exist, etc...

All things pointing to inability to provide the needed computing power (with growing demand) and using more and more ad-hoc scripts to extract information and rule based answeres rather than "true knowledge".

Supported topics

As the second message paste your own source code and ask for the following operations:

Vectorize the code
Vectorize and apply block processing
Apply multi-threading with block processing
When handling if-then clauses, ask for "masked" or "patch-up" implementation. Depending on the code, one could be (much) more efficient.
Optimize temporary variables.
Use TVec/TVecInt/TMtx/TMtxInt (or use Vector/VectorInt/Matrix/MatrixInt depending on your preference.)
Write benchmark and test comparing scalar and vectorized variant.

For your own protection

Never assume, that AI knows something, that you won't need to know. It can make errors and write complete nonsense in a split second.
Check if any of the calling objects are passed also as a parameter in the method. (This won't work.)
The generated code may have very obvious additional optimization options. The more you know about which methods exist, the easier it is to spot them. Point out specific methods, if they could be used.
Check for redundant Size calls. Most methods will also size the calling object internally.
Before starting to vectorize and multi-thread the code, implement comprehensive tests of your own code and a benchmark. This will allow you to track progress.
If you see some strange new function being called, ask to check for its existance.

We have prepared two examples. The first simpler example is on ChatGPT. The second example is running on Grok.

Example 1

Ask a question like this (using ChatGPT o3-mini or o3-mini-high):

"Given:

procedure AddMy(const a,b,c: TDoubleArray);
var i: integer;
begin
   for i := 0 to Length(a)-1 do
   begin
      c[i] := a[i] + 2*b[i]
   end;
end;

Please write a variant of AddMy, that copies data from dynamic arrays and performs the operation using MtxVec vector type."

We get an answer, that looks like this:

This code actually compiles and works correctly. Now you can try this:

"Please vectorize:

xRMS := 0;
for i := 0 to Length(x)-1 do
begin
xRMS := xRMS + sqr(X[i]);
end;
xRms := sqrt(xRms/Length(x))

for i := 0 to Length(x)-1 do
begin
a[i] := X[i]*xScale + y[i]*yScale + z[i] + cos(X[i]);
end;

The answer we get:

// Compute RMS of x:
xRMS := Sqrt(x.SumOfSquares / x.Length);

// Compute a = X*xScale + Y*yScale + Z + cos(X)
a := AddScaled(x, xScale, y, yScale, z) + cos(x);

Note that the AI correctly found the AddScaled pattern, to reduce the number of temporary objects created during evaluation. Next command would be:

"Looks good, now also apply Block processing please"

Which will return this:

var sumSquares: double;

begin
// --- Compute RMS of x using block processing ---
sumSquares := 0;
x.BlockInit; // Initialize block processing on x
while not x.BlockEnd do
begin
   // For the current block, add its sum of squares
   sumSquares := sumSquares + x.SumOfSquares;
   x.BlockNext; // Advance to the next block
end;
// After processing all blocks, compute the RMS:
xRMS := Sqrt(sumSquares / x.Length);

// --- Compute a = x*xScale + y*yScale + z + cos(x) using block processing ---
// Ensure 'a' is sized to match 'x'
a.Size(x);

// Initialize block processing on all vectors
x.BlockInit;
y.BlockInit;
z.BlockInit;
a.BlockInit;
while not x.BlockEnd do
begin
// For the current block, compute: a = x*xScale + y*yScale + z
a.AddScaled(x, xScale, y, yScale, z);
// Then add cos(x) element-wise to a
a.Add(cos(x));

// Advance all vectors to the next block
x.BlockNext;
y.BlockNext;
z.BlockNext;
a.BlockNext;
end;
end;

Next command would be:

Block processing only makes sense, when you can reuse a block multiple times in the same while loop. Try joining the two loops.

which, gives:

var
sumSquares: double;
begin
// Ensure 'a' is sized to match 'x'
a.Size(x);
sumSquares := 0;

// Initialize block processing on all vectors
x.BlockInit;
y.BlockInit;
z.BlockInit;
a.BlockInit;

while not x.BlockEnd do
begin
     // Accumulate sum of squares for the current block
     sumSquares := sumSquares + x.SumOfSquares;

     // For the current block, compute:
     // a = x * xScale + y * yScale + z + cos(x)
     // Using compound operations for efficiency:
     a.AddScaled(x, xScale, y, yScale, z);
     a.Add(cos(x));

     // Advance to the next block
     x.BlockNext;
     y.BlockNext;
     z.BlockNext;
     a.BlockNext;
end;

// Compute the overall RMS from the accumulated sum of squares
xRMS := Sqrt(sumSquares / x.Length);
end;

This is in fact correct and optimal answer. And the performance improvement:

Original, not threaded time = 7.10ms
Block vectorized and threaded = 1.27ms
Block vectorized with methods and threaded = 1.13ms

When using 8 core CPU.

Example 2

We start on grok.com by first pasting the prior knowledge text and then with:

After we ask for vectorization we finish with:

Here is what we had to do after the initial request for vectorization:

Ask to use TVec/TMtx instead of Vector/Matrix for local vars. This allso allows for the posibility of multi-threading. The use of CreateIt/FreeIt uses separate memory pools per thread and allows linear scaling with thread count. CreateIt/FreeIt is also cca 10x faster than calling TVec.Create/TVec.Free.
Request to remove the use of masks, because the result will be correct with NAN and INF, even without special handling.
Multiple messages were needed to explore the joining of individual TVec methods in to more compound expressions. Many times the chat can report, that it overlooked an overload and appologize. We end up using InvSqrt, MulAndDiv and Mul with three params.

And here is how the initial response looked like:

Note multiple cases of passing the calling object as a parameter instead of using an in-place variant of the method. (This will not run). The patch-up loop can be very performant and easy to use and understand, but sometimes the math will already produce the special numbers NAN and INF on its own. All other changes are related to the use of more compound vectorized expressions. And the benchmark?

For 32bit we get a 16x improvement on one CPU core
For 64bit we get 11x improvement on one CPU core

Plus retaining the possibility of linear scaling with CPU core count when multi-threading. Will this be faster than the output of the best C++ vectorizing compiler? Absolutely. It will be 2-3x faster than C++ unless some serious work with considerable experience would be done to optimize the vectorized function loop manually and it is not sure, that it will be possible to match the results.

Benchmarking considerations

Be sure to include an error check, that the results between scalar and vectorized version are equal
Compare performance on "valid function range", because typically you will not be computing on invalid data.
Limit the length of vectors to 1000. This is to reflect the block length when using block processing. With block processing you can process vectors of any length at performance levels as if though their size was only 1000. Ask the chat to apply "block processing" to see what it does.

Comparison to vectorized Python

Basic guidelines when using one CPU core:

When using Intel OneAPI IPP primitives line for line in both languages, MtxVec will be in worst case only 5x faster than Python.
In a typical case MtxVec will be cca 300x faster than Python.

And considerations for multi-threading:

MtxVec contains a number of mechanisms to achieve linear scaling with core count. 32 CPU cores, can give 32x faster code.
Python will typically deliver an improvement of about 2x even when 32 CPU cores are available when multi-threading.

Although a comprehensive overview would be a big job, these are some numbers to start with.

(Last update, June 11th, 2025)