MSVC ARM64 Optimizations in Visual Studio 2022 17.7
MSVC ARM64 Optimizations in Visual Studio 2022 17.7 Hongyon Suauthai (ARM) September 28th, 20230 2 In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We started introducing these performance optimizations in 17.6 release and we have landed them in 17.7 release. By element operation ARM64 supports by-element operation in several instructions such as fmul, fmla, fmls, etc. This feature allows a SIMD operand to be computed directly by a SIMD element using an index to access it. In the example below where we multiply a scalar with an array, MSVC duplicated the vector element v0.s[0] into all lanes of a SIMD register, then multiplied that with another SIMD operand represented by array b. This is not efficient because the dup instruction will add 2 more execution latency before executing the fmul instruction. To better understand this optimization let’s take the following sample code: void test(float * __restrict a, float * __restrict b, float c) { for (int i = 0; i < 4; i++) a[i] = b[i] * c; } Code generated by MSVC 17.6: dup v17.4s,v0.s[0] ldr q16,[x1] fmul v16.4s,v17.4s,v16.4s str q16,[x0] In Visual Studio 2022 17.7, we eliminated a duplicate instruction. It now can multiply by a SIMD element directly and the code generation is: ldr q16,[x1] fmul v16.4s,v16.4s,v0.s[0] str q16,[x0] Neon supports for shift right and accumulate immediate This instruction right-shifts a SIMD source by an immediate and accumulates the final result with the destination SIMD register. As mentioned earlier, we started working on this optimization in the 17.6 release and completed the feature in the 17.7 release. In the 17.5 release MSVC turned right shifts into left shifts using a negative shift amount. To implement it that way, it copied a constant -2 into a general register, and then duplicated a general register into the SIMD register before the left shift. We eliminated the duplicate instruction and used right shift with an immediate value in the 17.6 release. It was an improvement, but not yet good enough. We further improved the implementation in 17.7 by combining right shift and add by using usra for unsigned and ssra for signed operation. The final implementation is much more compact than the previous ones. To better understand how this optimization works let’s look at the sample code below: void test(unsigned long long * __restrict a, unsigned long long * __restrict b) { for (int i = 0; i < 2; i++) a[i] += b[i] >> 2; } Code generated by MSVC 17.5: mov x8,#-2 ldr q16,[x1] dup v17.2d,x8 ldr q18,[x0] ushl v17.2d,v16.2d,v17.2d add v18.2d,v17.2d,v18.2d str q18,[x0] Code generated by MSVC 17.6: ldr q16,[x1] ushr v17.2d,v16.2d,#2 ldr q16,[x0] add v16.2d,v17.2d,v16.2d str q16,[x0] Code generated by MSVC 17.7: ldr q16,[x0] ldr q17,[x1] usra v16.2d,v17.2d,#2 str q16,[x0] Neon right shift into cmp A right shift on a signed integer shifts the msb, with the result being either -1 or 0, and […]
