I don't have time to dig today, but what's the error for downsampling to fp8_e4m3 and doing multiplications in it? That's the real comparison. There's already huge errors vs. staying at high precision. And I just don't see the errors mattering much in inference (training, definitely ).
As for real comparisons, I think the closest ones are those earlier algorithms that do essentially the same. L-Mul computation is 1+x+y+(0.125 or 0.0625) and ApproxLM with the minimal level 1 precision does 1+x+y+(y), where y needs to be the smaller value. So simplest form it's almost the same.
I guess the worst case scenario would be something like truncating (instead of rounding) 1.99999... to 1.875 that 3 bit precision allows. That would be 6.25% error. And about double that if two of them are multiplied.
I had Claude write a test program for me ;) --- Analysis of FP8_e4m3 multiplication errors across 1000000 tests: Mean error ratio: 1.0371 Median error ratio: 1.0316 Max error ratio: 1.1492 --- This program only tested numbers that wouldn't overflow FP8_e4m3, of course.