MSVC ARM64 optimizations in Visual Studio 2022 17.6
In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the Visual Studio 2022 version 17.6. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let’s review some interesting optimizations in this blog. Before diving into technical details, we’d encourage you to create feedback here at Developer Community if you have found performance issues. The feedback helps us prioritize work items in our backlog. This, optimize neon right shift into cmp, is an example of good feedback. Including a tagged subject title, detailed description of the issue, and a simple repro simplifies our analysis work and helps us deliver a fix more quickly. Now, let’s see the optimizations. Auto-Vectorizer supports more NEON instructions with asymmetric operands The ARM64 backend already supports some NEON instructions with asymmetric typed operands, like Add/Subtract Long operations (SADDL/UADDL/SSUBL/USUBL). These instructions add each vector element in the lower or upper half of the first source SIMD register to the corresponding vector element of the second source SIMD register and write the vector result to the destination SIMD register. The destination vector elements are twice as long as the source vector elements. Now, we have extended such support to Multiply-Add Long and Multiply-Subtract Long (SMLAL/UMLAL/SMLSL/UMLSL). For example: void smlal(int * __restrict dst, int * __restrict a, short * __restrict b, short * __restrict c) { for (int i = 0; i < 4; i++) dst[i] = a[i] + b[i] * c[i]; } In Visual Studio 2022 17.5, the code-generation was: sxtl v19.4s,v16.4h sxtl v18.4s,v17.4h mla v20.4s,v18.4s,v19.4s Extra signed extensions are performed on both source operands to match the type of destination. Now it has been optimized into a single smlal v16.4s,v17.4h,v18.4h. The ARM64 ISA further supports another variant for these operations, which is called Add/Subtract Wide. For them, the asymmetry happens between source operands, not between source and destination. For example: void saddw(int *__restrict dst, int *__restrict a, short *__restrict b) { for (int i = 0; i < 4; i++) dst[i] = a[i] + b[i]; } In Visual Studio 2022 17.5, the code-generation was: sxtl v17.4s,v16.4h add v18.4s,v17.4s,v18.4s The narrow source gets extra signed extension to match the other wide source. In the 17.6 release, this has been optimized into a single saddw v16.4s,v16.4s,v17.4h. The same applies to UADDW/SSUBW/USUBW. Auto-vectorizer now supports small types on ABS/MIN/MAX ABS/MIN/MAX have slightly complex semantics. Normally, the compiler middle-end or back-end will have a pattern matcher to recognize IR sequences with if-then-else semantics and see if they could be converted into ABS/MIN/MAX. There is an issue when the operands are in small types (int8 or int16) though. As specified by the C++ standard, small types are promoted to int, which is 32-bit on ARM64. This is perfect for scalar operations because they really can only operate on scalar register width. For ARM64, the smallest width is 32-bit utilizing the sub-register. However, this is not true for SIMD ISA whose minimum operation width is the width of vector lane (element). For example, ARM64 NEON supports operating on int8, int16 for a couple of operations including ABS/MIN/MAX. So, to generate SIMD instructions operating on small element sizes and deliver higher computing throughput, the auto-vectorizer needs to do […]
