Rendered at 11:36:59 GMT+0000 (Coordinated Universal Time) with Wasmer Edge.
Const-me 7 days ago [-]
> which means that not even writing C++ SIMD code will make me have a faster implementation than the one Python is using and I will probably have to write my own assembly code
I believe assembly is almost always the wrong choice in modern world. It’s just that your SIMD version is not very efficient.
Your original SIMD version completes in 0.065ms on my computer. Here’s an optimized version which completes in 0.038ms i.e. 1.7x faster: https://gist.github.com/Const-me/41b013229b20f920bcee22a856c...
Note I have used 4 sets of the accumulators to workaround relatively high latency of the FMA instructions.
However, I’m pretty sure the implementation used by these Python libraries is leveraging multiple CPU cores under the hood.
Here’s another C++ version which does that as well, it completed in 0.0136 ms on my computer i.e. 4.8x faster:
https://gist.github.com/Const-me/c61e836bed08cef2f06783c7b11...
neonsunset 7 days ago [-]
> but unless you opt to implement a processor-specific calculation in C++
Not necessarily true if you use C# (or Swift or Mojo) instead:
static float CosineSimilarity(
ref float a,
ref float b,
nuint length
) {
var sdot = Vector256<float>.Zero;
var sa = Vector256<float>.Zero;
var sb = Vector256<float>.Zero;
for (nuint i = 0; i < length; i += 8) {
var bufa = Vector256.LoadUnsafe(ref a, i);
var bufb = Vector256.LoadUnsafe(ref b, i);
sdot = Vector256.FusedMultiplyAdd(bufa, bufb, sdot);
sa = Vector256.FusedMultiplyAdd(bufa, bufa, sa);
sb = Vector256.FusedMultiplyAdd(bufb, bufb, sb);
}
var fdot = Vector256.Sum(sdot);
var fanorm = Vector256.Sum(sa);
var fbnorm = Vector256.Sum(sb);
return fdot / MathF.Sqrt(fanorm) * MathF.Sqrt(fbnorm);
}
Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)
ashvardanian 7 days ago [-]
The conclusion of the article isn't entirely accurate.
> Why? Because they are calculated using the BLAS library available in the OS, which means that not even writing C++ SIMD code will make me have a faster implementation than the one Python is using and I will probably have to write my own assembly code with compiler-like tricks to go as fast as Python plus C++ libraries.
Switching from C++ SIMD intrinsics to assembly won't yield significant performance improvements for cosine distance. The kernel is typically too small for reordering tricks to have much impact. In fact, you can already outperform both NumPy and SciPy with optimized intrinsics, as I discussed in a previous HN post (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). There's also a promising RFC in SciPy to allow pluggable backends, which could make integrating such kernels easier (https://github.com/scipy/scipy/issues/21645#issuecomment-239...).
For many-to-many distance calculations on low-dimensional vectors, the bottleneck is often in the reciprocal square roots. Using LibC is slow but highly accurate. Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision, and the number of iterations varies between x86 and Arm architectures (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). When dealing with high-dimensional vectors, you start competing with BLAS implementations, though they, too, have limitations on newer hardware or in mixed precision scenarios.
Const-me 7 days ago [-]
> Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision
I wonder have you tried non-approximated intrinsics like _mm_div_ps( mul, _mm_sqrt_ps( div2 ) ) ?
The reason standard library is so slow is exception handling and other edge cases. On modern CPUs normal non-approximated FP division and square root instructions aren’t terribly slow, e.g. on my computer FP32 square root instruction has 15 cycles latency and 0.5 cycles throughput.
adgjlsfhk1 7 days ago [-]
yeah you generally can't approximate sqrt faster than computing it. sqrt is generally roughly as fast as division.
mkristiansen 7 days ago [-]
This is really interesting. I have a couple of questions, mainly from the fact that the code is c++ code is about 2x slower than then Numpy.
you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.
I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.
Again -- This was a really interesting writeup :)
worstspotgain 6 days ago [-]
Interesting read. However, its ultimate conclusion is just the long way of saying that if you use ultra-optimized libraries, and you call them at a rate much lower than their inner loop rate, then it doesn't matter which language you wrap them with.
This is the standard counter to "interpreted languages are slow" and is as old as interpreting itself.
rrhjm53270 6 days ago [-]
Thank you for sharing such an interesting work.
A little comment: adding some more aggressive optimization optimization options to simd C++ code to see the performance difference.
On my side with a AMD Ryzen 9 7900X3D CPU, I have
- 0.0592569 ms for `-O3 -march=native` option, and
- 1.7741e-05 ms for `-funsafe-math-optimizations -Ofast -flto=auto -pipe -march=native`
Const-me 6 days ago [-]
OP’s benchmark doesn’t use the return value from the function being benchmarked. When C++ compilers are asked to optimize such code, they often optimize away the whole function.
Pretty sure your 17.7 nanoseconds result had the whole function optimized away. Workarounds are tricky and compiler-specific.
intalentive 6 days ago [-]
Arch specific similarity functions in SIMD, with bindings in Python and other languages:
Isn't it pretty bad for accuracy to accumulate large numbers of floats in this fashion? o.O In the example it's 640,000 numbers. log2(640,000) is ~19.3 but the significand of a float has only 23 bits plus an implicit one.
hansvm 6 days ago [-]
Python's floats are usually doubles by default, so it's mostly fine.
That said, yeah, that implementation isn't ideal. At a minimum, Kahan summation is usually free on large vectors (you're bottlenecked on memory bandwidth anyway), give or take the fact that you need to disable floating point re-ordering to keep the compiler from screwing it up and therefore have to order the operations correctly to make it efficient (see some other top-level comments about data dependencies as an example).
QuadmasterXLII 7 days ago [-]
I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products
Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.
mshockwave 7 days ago [-]
which flags did you use and which compiler version?
QuadmasterXLII 7 days ago [-]
clang 19, -O3 -ffast-math -march=native
mshockwave 7 days ago [-]
can confirm fast math makes the biggest difference
QuadmasterXLII 6 days ago [-]
I feel like I’m kinda being the bad aunt by encouraging -ffast-math. It can definitely break some things (i.e. https://pspdfkit.com/blog/2021/understanding-fast-math/ ) but I use it habitually and I’m fine so clearly it’s safe.
magicalhippo 6 days ago [-]
> It can definitely break some things
I recall it totally fudged up the ray-axis aligned bounding box intersection routine in the raytracer I worked on. The routine relied on infinities being handled correctly, and -ffast-math broke that.
I see the linked article goes into that aspect in detail, wish I had it back then.
IIRC we ended up disabling it for just that file, as it did speed up the rest my a fair bit.
hansvm 6 days ago [-]
I would love a fast-math implementation which handled inf correctly, but no language/compiler seems to care.
mkristiansen 7 days ago [-]
Fast Math basically means "who cares about standards just add in whatever order you want" :)
I believe assembly is almost always the wrong choice in modern world. It’s just that your SIMD version is not very efficient.
Your original SIMD version completes in 0.065ms on my computer. Here’s an optimized version which completes in 0.038ms i.e. 1.7x faster: https://gist.github.com/Const-me/41b013229b20f920bcee22a856c... Note I have used 4 sets of the accumulators to workaround relatively high latency of the FMA instructions.
However, I’m pretty sure the implementation used by these Python libraries is leveraging multiple CPU cores under the hood. Here’s another C++ version which does that as well, it completed in 0.0136 ms on my computer i.e. 4.8x faster: https://gist.github.com/Const-me/c61e836bed08cef2f06783c7b11...
Not necessarily true if you use C# (or Swift or Mojo) instead:
Compiles to appropriate codegen quality: https://godbolt.org/z/hh16974Gd, on ARM64 it's correctly unrolled to 128x2Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)
> Why? Because they are calculated using the BLAS library available in the OS, which means that not even writing C++ SIMD code will make me have a faster implementation than the one Python is using and I will probably have to write my own assembly code with compiler-like tricks to go as fast as Python plus C++ libraries.
Switching from C++ SIMD intrinsics to assembly won't yield significant performance improvements for cosine distance. The kernel is typically too small for reordering tricks to have much impact. In fact, you can already outperform both NumPy and SciPy with optimized intrinsics, as I discussed in a previous HN post (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). There's also a promising RFC in SciPy to allow pluggable backends, which could make integrating such kernels easier (https://github.com/scipy/scipy/issues/21645#issuecomment-239...).
For many-to-many distance calculations on low-dimensional vectors, the bottleneck is often in the reciprocal square roots. Using LibC is slow but highly accurate. Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision, and the number of iterations varies between x86 and Arm architectures (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). When dealing with high-dimensional vectors, you start competing with BLAS implementations, though they, too, have limitations on newer hardware or in mixed precision scenarios.
I wonder have you tried non-approximated intrinsics like _mm_div_ps( mul, _mm_sqrt_ps( div2 ) ) ?
The reason standard library is so slow is exception handling and other edge cases. On modern CPUs normal non-approximated FP division and square root instructions aren’t terribly slow, e.g. on my computer FP32 square root instruction has 15 cycles latency and 0.5 cycles throughput.
I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG
if you look at the assembly generated:
you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.
which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.Again -- This was a really interesting writeup :)
This is the standard counter to "interpreted languages are slow" and is as old as interpreting itself.
A little comment: adding some more aggressive optimization optimization options to simd C++ code to see the performance difference.
On my side with a AMD Ryzen 9 7900X3D CPU, I have
- 0.0592569 ms for `-O3 -march=native` option, and - 1.7741e-05 ms for `-funsafe-math-optimizations -Ofast -flto=auto -pipe -march=native`
Pretty sure your 17.7 nanoseconds result had the whole function optimized away. Workarounds are tricky and compiler-specific.
https://github.com/ashvardanian/SimSIMD
Isn't it pretty bad for accuracy to accumulate large numbers of floats in this fashion? o.O In the example it's 640,000 numbers. log2(640,000) is ~19.3 but the significand of a float has only 23 bits plus an implicit one.
That said, yeah, that implementation isn't ideal. At a minimum, Kahan summation is usually free on large vectors (you're bottlenecked on memory bandwidth anyway), give or take the fact that you need to disable floating point re-ordering to keep the compiler from screwing it up and therefore have to order the operations correctly to make it efficient (see some other top-level comments about data dependencies as an example).
Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.
I recall it totally fudged up the ray-axis aligned bounding box intersection routine in the raytracer I worked on. The routine relied on infinities being handled correctly, and -ffast-math broke that.
I see the linked article goes into that aspect in detail, wish I had it back then.
IIRC we ended up disabling it for just that file, as it did speed up the rest my a fair bit.