MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7 Troy Johnson September 25th, 20232 3 This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization. Optimizing Memory Across Block Boundaries When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases. In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage. Example C++ Source Code: #include bool compare(const std::string_view& l, const std::string_view& r) { return l == r; } Required Compiler Flags: /O2 X64 ASM: 17.4 17.7 sub rsp, 56 movups xmm0, XMMWORD PTR [rcx] mov r8, QWORD PTR [rcx+8] movaps XMMWORD PTR $T1[rsp], xmm0 cmp r8, QWORD PTR [rdx+8] jne SHORT $LN9@compare mov rdx, QWORD PTR [rdx] mov rcx, QWORD PTR $T1[rsp] call memcmp test eax, eax jne SHORT $LN9@compare mov al, 1 add rsp, 56 ret 0 $LN9@compare: xor al, al add rsp, 56 ret 0 sub rsp, 40 mov r8, QWORD PTR [rcx+8] movups xmm1, XMMWORD PTR [rcx] cmp r8, QWORD PTR [rdx+8] jne SHORT $LN9@compare mov rdx, QWORD PTR [rdx] movq rcx, xmm1 call memcmp test eax, eax jne SHORT $LN9@compare mov al, 1 add rsp, 40 ret 0 $LN9@compare: xor al, al add rsp, 40 ret 0 ARM64 ASM: 17.4 17.7 str lr,[sp,#-0x10]! sub sp,sp,#0x20 ldr q17,[x1] ldr q16,[x0] umov x8,v17.d[1] umov x2,v16.d[1] stp q17,q16,[sp] cmp x2,x8 bne |$LN9@compare| ldr x1,[sp] ldr x0,[sp,#0x10] bl memcmp cbnz w0,|$LN9@compare| mov w0,#1 add sp,sp,#0x20 ldr lr,[sp],#0x10 ret |$LN9@compare| mov w0,#0 add sp,sp,#0x20 ldr lr,[sp],#0x10 ret str lr,[sp,#-0x10]! ldr q17,[x1] ldr q16,[x0] umov x8,v17.d[1] umov x2,v16.d[1] cmp x2,x8 bne |$LN9@compare| fmov x1,d17 fmov x0,d16 bl memcmp cbnz w0,|$LN9@compare| mov w0,#1 ldr lr,[sp],#0x10 ret |$LN9@compare| mov w0,#0 ldr lr,[sp],#0x10 ret Vector Logical and Arithmetic Optimizations We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64. Example C++ Source Code: #include void s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for (int i = 0; i < n; i++) { a[i] = (b[i] - c[i]) > 0 ? (b[i] – c[i]) : (c[i] – b[i]); } } Required Flags: /O2 /arch:AVX for X64, /O2 for ARM64 X64 ASM: 17.4 17.7 $LL4@s32_1: movdqu xmm0, XMMWORD PTR [r11+rax] add ecx, 4 movdqu xmm1, XMMWORD PTR [rax] lea rax, QWORD PTR [rax+16] movdqa xmm3, xmm0 psubd xmm3, xmm1 psubd xmm1, xmm0 movdqa xmm2, xmm3 pcmpgtd xmm2, xmm4 movdqa xmm0, xmm2 andps xmm2, xmm3 andnps xmm0, xmm1 orps xmm0, xmm2 movdqu XMMWORD PTR [r10+rax-16], xmm0 cmp ecx, edx jl SHORT $LL4@s32_1 $LL4@s32_1: vmovdqu xmm1, XMMWORD PTR [r10+rax] vpsubd xmm1, xmm1, XMMWORD PTR [rax] vpabsd xmm2, […]