BLAKE2b performance on Apple Silicon

For work, I was going to store some hashed tokens in a database. I was going to keep it simple and go with HMAC-SHA256 for it but having recently read Jean-Philippe Aumasson’s book “Serious Cryptography” I remembered that BLAKE2 should be quicker:

BLAKE2 was designed with the following ideas in mind:

It should be faster than all previous hash standards

Cool, I thought, let’s consider BLAKE2 then. First, let’s write a simple benchmark to see just how much faster BLAKE2 would be than HMAC-SHA256. Performance is not important for my use case as the hashing will almost certainly not be a bottleneck but I was curious. So I write a benchmark:

import (
	"crypto/hmac"
	"crypto/rand"
	"crypto/sha256"
	"log"
	"testing"

	"golang.org/x/crypto/blake2b"
)

func BenchmarkHashes(b *testing.B) {
	token := []byte("some-api-token")
	secretKey := generateSecretKey()

	b.ResetTimer()

	_ = b.Run("HMACSHA256", func(b *testing.B) {
		for b.Loop() {
			_ = HMACSHA256(token, secretKey)
		}
	})

	_ = b.Run("BLAKE2b", func(b *testing.B) {
		for b.Loop() {
			_ = BLAKE2b(token, secretKey)
		}
	})
}

func generateSecretKey() []byte {
	key := make([]byte, 32)
	_, err := rand.Read(key)
	if err != nil {
		panic(err)
	}
	return key
}

func HMACSHA256(token []byte, secretKey []byte) []byte {
	h := hmac.New(sha256.New, secretKey)
	h.Write(token)
	return h.Sum(nil)
}

func BLAKE2b(token []byte, secretKey []byte) []byte {
	hasher, err := blake2b.New256(secretKey)
	if err != nil {
		log.Fatal(err)
	}
	hasher.Write(token)
	return hasher.Sum(nil)
}

Run it and the results are:

cpu: Apple M1 Max
BenchmarkHashes/HMACSHA256-10         	 3442680	       341.0 ns/op	     512 B/op	       6 allocs/op
BenchmarkHashes/BLAKE2b-10            	 1966382	       584.0 ns/op	     416 B/op	       2 allocs/op

OK… So BLAKE2 is slower than HMAC-SHA256. Yes, we have less allocations which is nice, but it does take quite a few more CPU cycles. My first thought is that it might indeed be faster but only if the input is like way, way longer. So I switch to the following token:

	token := []byte(strings.Repeat("some-api-token", 10000))
cpu: Apple M1 Max
BenchmarkHashes/HMACSHA256-10         	   19536	     59946 ns/op	     512 B/op	       6 allocs/op
BenchmarkHashes/BLAKE2b-10            	    6452	    190926 ns/op	     416 B/op	       2 allocs/op

Hmmmm… This is even worse. Now, it could be that we’re dealing with a non optimal implementation. SHA256 is implemented in the stdlib in Go and very likely well optimized, whereas the BLAKE2 implementation I’m using comes from the golang.org/x/crypto package. Perhaps we can find a better one. The first search result recommends github.com/minio/blake2b-simd which has been archived since 2018. Not promising, but let’s give it a shot.

import (
    blakeMinio "github.com/minio/blake2b-simd"
)

func BLAKE2bMinio(token []byte, secretKey []byte) []byte {
	hasher := blakeMinio.NewMAC(32, secretKey)
	hasher.Write(token)
	return hasher.Sum(nil)
}

These are the results with the original token.

cpu: Apple M1 Max
BenchmarkHashes/HMACSHA256-10         	 3622322	       316.7 ns/op	     512 B/op	       6 allocs/op
BenchmarkHashes/BLAKE2b-10            	 2881012	       415.6 ns/op	     416 B/op	       2 allocs/op
BenchmarkHashes/BLAKE2b-minio-10       	 2138151	       566.4 ns/op	     480 B/op	       2 allocs/op

So this is even slower… I quickly glance over the codebase of github.com/minio/blake2b-simd and notice all the architecture specific files but there aren’t any for ARM architecture. Let’s try on a machine with an AMD64 processor.

cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashes/HMACSHA256-4         	  761613	      1488 ns/op	     512 B/op	       6 allocs/op
BenchmarkHashes/BLAKE2b-4            	 2287569	       566.9 ns/op	     416 B/op	       2 allocs/op
BenchmarkHashes/BLAKE2b-minio-4      	 1541990	       765.4 ns/op	     480 B/op	       2 allocs/op

Right, so it seems that the implementations I’m using are only optimized for AMD64. Or are they? One final test, I spin up an ARM based VPS on Hetzner to test it out. Note, that since the benchmark was not able to determine the exact CPU model the CPU line was omitted.

BenchmarkHashes/HMACSHA256-2         	 1000000	      1007 ns/op	       512 B/op	       6 allocs/op
BenchmarkHashes/BLAKE2b-2            	 1367085	       879.6 ns/op	       416 B/op	       2 allocs/op
BenchmarkHashes/BLAKE2b-minio-2      	 1123965	      1080 ns/op	       480 B/op	       2 allocs/op

So with this, it seems that the issue only occurs on Apple Silicon processors. In the benchmarks with other processors, the BLAKE2b does win. I’m not entirely sure what causes this as CPU architectures are not my strong suite. If you do - please let me know by posting a comment.

Comments