MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7
MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7
This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization.
Optimizing Memory Across Block Boundaries
When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases.
In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage.
Example C++ Source Code:
#include
bool compare(const std::string_view& l, const std::string_view& r) {
return l == r;
}
Required Compiler Flags: /O2
X64 ASM:
17.4 | 17.7 |
sub rsp, 56 movups xmm0, XMMWORD PTR [rcx] mov r8, QWORD PTR [rcx+8] movaps XMMWORD PTR $T1[rsp], xmm0 cmp r8, QWORD PTR [rdx+8] jne SHORT $LN9@compare mov rdx, QWORD PTR [rdx] mov rcx, QWORD PTR $T1[rsp] call memcmp test eax, eax jne SHORT $LN9@compare mov al, 1 add rsp, 56 ret 0 $LN9@compare: xor al, al add rsp, 56 ret 0 |
sub rsp, 40 mov r8, QWORD PTR [rcx+8] movups xmm1, XMMWORD PTR [rcx] cmp r8, QWORD PTR [rdx+8] jne SHORT $LN9@compare mov rdx, QWORD PTR [rdx] movq rcx, xmm1 call memcmp test eax, eax jne SHORT $LN9@compare mov al, 1 add rsp, 40 ret 0 $LN9@compare: xor al, al add rsp, 40 ret 0 |
ARM64 ASM:
17.4 | 17.7 |
str lr,[sp,#-0x10]! sub sp,sp,#0x20 ldr q17,[x1] ldr q16,[x0] umov x8,v17.d[1] umov x2,v16.d[1] stp q17,q16,[sp] cmp x2,x8 bne |$LN9@compare| ldr x1,[sp] ldr x0,[sp,#0x10] bl memcmp cbnz w0,|$LN9@compare| mov w0,#1 add sp,sp,#0x20 ldr lr,[sp],#0x10 ret |$LN9@compare| mov w0,#0 add sp,sp,#0x20 ldr lr,[sp],#0x10 ret |
str lr,[sp,#-0x10]! ldr q17,[x1] ldr q16,[x0] umov x8,v17.d[1] umov x2,v16.d[1] cmp x2,x8 bne |$LN9@compare| fmov x1,d17 fmov x0,d16 bl memcmp cbnz w0,|$LN9@compare| mov w0,#1 ldr lr,[sp],#0x10 ret |$LN9@compare| mov w0,#0 ldr lr,[sp],#0x10 ret |
Vector Logical and Arithmetic Optimizations
We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64.
Example C++ Source Code:
#includevoid s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for (int i = 0; i < n; i++) { a[i] = (b[i] - c[i]) > 0 ? (b[i] - c[i]) : (c[i] - b[i]); } }
Required Flags: /O2 /arch:AVX for X64, /O2 for ARM64
X64 ASM:
17.4 | 17.7 |
$LL4@s32_1: movdqu xmm0, XMMWORD PTR [r11+rax] add ecx, 4 movdqu xmm1, XMMWORD PTR [rax] lea rax, QWORD PTR [rax+16] movdqa xmm3, xmm0 psubd xmm3, xmm1 psubd xmm1, xmm0 movdqa xmm2, xmm3 pcmpgtd xmm2, xmm4 movdqa xmm0, xmm2 andps xmm2, xmm3 andnps xmm0, xmm1 orps xmm0, xmm2 movdqu XMMWORD PTR [r10+rax-16], xmm0 cmp ecx, edx jl SHORT $LL4@s32_1 |
$LL4@s32_1: vmovdqu xmm1, XMMWORD PTR [r10+rax] vpsubd xmm1, xmm1, XMMWORD PTR [rax] vpabsd xmm2, xmm1 ; vector abs add edx, 4 vmovdqu XMMWORD PTR [rbx+rax], xmm2 lea rax, QWORD PTR [rax+16] cmp edx, ecx jl SHORT $LL4@s32_1 |
ARM64 ASM:
17.4 | 17.7 |
|$LL24@s32_1| sbfiz x9,x8,#2,#0x20 ldr q19,[x9,x2] add w8,w8,#4 ldr q18,[x9,x1] cmp w8,w10 sub v16.4s,v18.4s,v19.4s cmgt v17.4s,v16.4s,v21.4s and v20.16b,v17.16b,v16.16b sub v16.4s,v19.4s,v18.4s bic v16.16b,v16.16b,v17.16b orr v16.16b,v16.16b,v20.16b str q16,[x9,x0] blt |$LL24@s32_1| cmp w8,w3 bge |$LN27@s32_1| |
|$LL24@s32_1| sbfiz x9,x8,#2,#0x20 ldr q17,[x9,x1] add w8,w8,#4 ldr q16,[x9,x2] cmp w8,w10 sabd v16.4s,v17.4s,v16.4s ; vector abs str q16,[x9,x0] blt |$LL24@s32_1| |
Vector Remainder Loops
Each iteration of a vectorized loop computes the same work as multiple iterations of the original scalar loop. Depending on the number of original scalar iterations and the vector width, there may be work left over when the vector loop ends. Although this work can be completed by a copy of the original scalar loop, it may contain sufficient work to warrant further vectorization using a smaller vector width. This optimization is now employed for targets that have a single vector width by partially packing a vector.
In the example below, three loops are emitted for the original loop. The first processes 64 elements per vector iteration by packing 16 8-bit values into a 128-bit vector register and then unrolling the vector loop by four. The second loop processes 8 elements per vector iteration by packing 8 8-bit values into half of a 128-bit vector register. The third loop processes one element per scalar iteration.
Example C++ source code:
#includevoid test(int8_t * __restrict d, int8_t * __restrict a, int8_t * __restrict b, int n) { for (int i = 0; i < n; i++) { d[i] = a[i] & (~b[i]); } }
Required compiler flags: /O2
X64 ASM:
17.4 | 17.7 |
$LL4@test: ; 1st vector loop (64 element iters) movdqu xmm0, XMMWORD PTR [rcx-16] add edi, 64 ; 00000040H movdqu xmm1, XMMWORD PTR [rdx+rcx-16] movdqu xmm2, XMMWORD PTR [rdx+rcx] lea rcx, QWORD PTR [rcx+64] pandn xmm1, xmm3 lea rax, QWORD PTR [r14+rcx] pand xmm1, xmm0 pandn xmm2, xmm3 movdqu xmm0, XMMWORD PTR [rcx-64] movdqu XMMWORD PTR [r9+rcx-80], xmm1 pand xmm2, xmm0 movdqu xmm0, XMMWORD PTR [rcx-48] movdqu xmm1, XMMWORD PTR [rdx+rcx-48] movdqu XMMWORD PTR [r9+rcx-64], xmm2 pandn xmm1, xmm3 pand xmm1, xmm0 movdqu xmm2, XMMWORD PTR [rdx+rcx-32] movdqu xmm0, XMMWORD PTR [rcx-32] pandn xmm2, xmm3 pand xmm2, xmm0 movdqu XMMWORD PTR [r9+rcx-48], xmm1 movdqu XMMWORD PTR [r9+rcx-32], xmm2 cmp rax, rsi jl SHORT $LL4@test mov r14, QWORD PTR [rsp] mov rsi, QWORD PTR [rsp+24] $LN9@test: movsxd rcx, edi mov rdx, rbx mov rdi, QWORD PTR [rsp+32] mov rbx, QWORD PTR [rsp+16] cmp rcx, rdx jge SHORT $LN3@test sub r8, r11 lea rax, QWORD PTR [rcx+r11] sub r10, r11 sub rdx, rcx $LL8@test: ; Scalar loop (1 element iters) movzx ecx, BYTE PTR [rax+r8] lea rax, QWORD PTR [rax+1] not cl and cl, BYTE PTR [rax-1] mov BYTE PTR [rax+r10-1], cl sub rdx, 1 jne SHORT $LL8@test $LN3@test: add rsp, 8 ret 0 |
$LL4@test: ; 1st vector loop (64 element iters) movdqu xmm0, XMMWORD PTR [rcx-16] add edx, 64 ; 00000040H movdqu xmm1, XMMWORD PTR [r8+rcx-16] movdqu xmm2, XMMWORD PTR [r8+rcx] lea rcx, QWORD PTR [rcx+64] andnps xmm1, xmm3 lea rax, QWORD PTR [rsi+rcx] andps xmm1, xmm0 andnps xmm2, xmm3 movdqu xmm0, XMMWORD PTR [rcx-64] movdqu XMMWORD PTR [r9+rcx-80], xmm1 andps xmm2, xmm0 movdqu xmm0, XMMWORD PTR [rcx-48] movdqu xmm1, XMMWORD PTR [r8+rcx-48] movdqu XMMWORD PTR [r9+rcx-64], xmm2 andnps xmm1, xmm3 andps xmm1, xmm0 movdqu xmm2, XMMWORD PTR [r8+rcx-32] movdqu xmm0, XMMWORD PTR [rcx-32] andnps xmm2, xmm3 andps xmm2, xmm0 movdqu XMMWORD PTR [r9+rcx-48], xmm1 movdqu XMMWORD PTR [r9+rcx-32], xmm2 cmp rax, rbp jl SHORT $LL4@test mov rbp, QWORD PTR [rsp+16] mov eax, edi and al, 63 ; 0000003fH cmp al, 8 jb SHORT $LN27@test $LN11@test: mov ecx, edi movsxd rax, edx and ecx, -8 movsxd rsi, ecx lea rcx, QWORD PTR [rax+r10] $LL10@test: ; 2nd vector loop (8 element iters) movq xmm1, QWORD PTR [r8+rcx] add edx, 8 movq xmm0, QWORD PTR [rcx] andnps xmm1, xmm3 andps xmm1, xmm0 movq QWORD PTR [r9+rcx], xmm1 add rcx, 8 mov rax, rcx sub rax, r10 cmp rax, rsi jl SHORT $LL10@test $LN27@test: mov rsi, QWORD PTR [rsp+24] $LN9@test: movsxd rcx, edx mov rdx, rdi mov rdi, QWORD PTR [rsp+32] cmp rcx, rdx jge SHORT $LN3@test sub r11, r10 lea rax, QWORD PTR [rcx+r10] sub rbx, r10 sub rdx, rcx npad 6 $LL8@test: ; Scalar loop (1 element iters) movzx ecx, BYTE PTR [rax+r11] lea rax, QWORD PTR [rax+1] not cl and cl, BYTE PTR [rax-1] mov BYTE PTR [rax+rbx-1], cl sub rdx, 1 jne SHORT $LL8@test $LN3@test: pop rbx ret 0 |
ARM64 ASM:
17.4 (was not vectorized) | 17.7 |
|$LL4@test| ldrsb w9,[x10,x1] sub w3,w3,#1 ldrsb w8,[x1] bic w8,w8,w9 strb w8,[x11,x1] add x1,x1,#1 cbnz w3,|$LL4@test| |$LN3@test| ret |
|$LL27@test| ; 1st vector loop (64 element iters) ldr q17,[x2,w8 sxtw #0] sxtw x11,w8 ldr q16,[x1,w8 sxtw #0] add x10,x11,#0x10 bic v16.16b,v16.16b,v17.16b ldr q17,[x10,x2] str q16,[x0,w8 sxtw #0] ldr q16,[x10,x1] add w8,w8,#0x40 cmp w8,w9 bic v16.16b,v16.16b,v17.16b str q16,[x10,x0] add x10,x11,#0x20 ldr q17,[x10,x2] ldr q16,[x10,x1] bic v16.16b,v16.16b,v17.16b str q16,[x10,x0] add x10,x11,#0x30 ldr q17,[x10,x2] ldr q16,[x10,x1] bic v16.16b,v16.16b,v17.16b str q16,[x10,x0] blt |$LL27@test| and w9,w3,#0x3F cmp w9,#8 blo |$LN30@test| |$LN11@test| and w9,w3,#0xFFFFFFF8 |$LL29@test| ; 2nd vector loop (8 element iters) ldr d17,[x2,w8 sxtw #0] ldr d16,[x1,w8 sxtw #0] bic v16.8b,v16.8b,v17.8b str d16,[x0,w8 sxtw #0] add w8,w8,#8 cmp w8,w9 blt |$LL29@test| |$LN30@test| cmp w8,w3 bge |$LN32@test| |$LN21@test| add x9,x1,w8,sxtw #0 sub x13,x2,x1 sub w8,w3,w8 sub x10,x0,x1 |$LL31@test| ; Scalar loop (1 element iters) ldrsb w12,[x13,x9] sub w8,w8,#1 ldrsb w11,[x9] bic w11,w11,w12 strb w11,[x10,x9] add x9,x9,#1 cbnz w8,|$LL31@test| |$LN32@test| ret |
Conditional Moves/Selects
MSVC now considers more opportunities to use conditional move instructions to eliminate branches. Once such an opportunity is recognized, whether to use a conditional move or leave the branch intact is subject to legality checks and tuning heuristics. Eliminating a branch is often beneficial because it removes the possibility of the branch being mispredicted and may enable other optimizations that previously would have stopped at the block boundary formed by the branch. Nevertheless, it does convert a control dependence that could be correctly predicted into a data dependence that cannot be removed.
This optimization applies only when the compiler can prove that it is safe to unconditionally perform a store because the instruction will always store one of two possible values instead of only performing a store when the condition is true. There are numerous reasons why an unconditional store may be unsafe. For example, if the memory location is shared, then another thread may observe a store when previously there was no store to observe. As another example, the condition may have protected a store through a potentially null pointer.
The optimization is enabled for X64 as well as ARM64. (ARM64 conditional execution was discussed in a previous blog post.)
Example C++ source code:
int test(int a, int i) { int mem[4]{0}; if (mem[i] < a) { mem[i] = a; } return mem[0]; }
Required compiler flags: /O2 (for the sake of simplifying the example output, /GS- was also passed)
X64 ASM:
17.4 | 17.7 |
$LN6: sub rsp, 24 movsxd rax, edx xorps xmm0, xmm0 movups XMMWORD PTR mem$[rsp], xmm0 cmp DWORD PTR mem$[rsp+rax*4], ecx jge SHORT $LN4@test mov DWORD PTR mem$[rsp+rax*4], ecx $LN4@test: mov eax, DWORD PTR mem$[rsp] add rsp, 24 ret 0 |
$LN5: sub rsp, 24 movsxd rax, edx xorps xmm0, xmm0 movups XMMWORD PTR mem$[rsp], xmm0 lea rdx, QWORD PTR mem$[rsp] cmp DWORD PTR [rdx+rax*4], ecx lea rdx, QWORD PTR [rdx+rax*4] cmovge ecx, DWORD PTR [rdx] ; cond mov DWORD PTR [rdx], ecx mov eax, DWORD PTR mem$[rsp] add rsp, 24 ret 0 |
ARM64 ASM:
17.4 | 17.7 |
|$LN5| sub sp,sp,#0x10 movi v16.16b,#0 mov x9,sp str q16,[sp] ldr w8,[x9,w1 sxtw #2] cmp w8,w0 bge |$LN2@test| str w0,[x9,w1 sxtw #2] |$LN2@test| ldr w0,[sp] add sp,sp,#0x10 ret |
|$LN9| sub sp,sp,#0x10 movi v16.16b,#0 mov x9,sp str q16,[sp] ldr w8,[x9,w1 sxtw #2] cmp w8,w0 cselge w8,w8,w0 str w8,[x9,w1 sxtw #2] ldr w0,[sp] add sp,sp,#0x10 ret |
Loop Optimizations for Array Assignments and Copies
The optimizer now recognizes more cases where an assignment/copy to contiguous memory can be raised to memset/memcopy intrinsics and lowered to more efficient instruction sequences. For example, consider the following array-assignment in a count-down loop.
Example C++ Source Code:
char c[1024]; for (int n = 1023; n; n--) c[n] = 1;
Previously, the code generated was a loop with individual byte-assignments. Now, more efficient wider assignments are performed via block-copy libraries or vector instructions.
X64 ASM:
17.4 | 17.7 |
mov QWORD PTR __$ArrayPad$[rsp], rax mov eax, 1023 npad 2 $LL4@copy_while: mov BYTE PTR c$[rsp+rax], 1 sub rax, 1 jne SHORT $LL4@copy_while |
vmovaps zmm0, ZMMWORD PTR __zmm@01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101 vmovq rdx, xmm0 lea rax, QWORD PTR c$[rsp+1] mov ecx, 7 npad 14 $LL11@copy_while: vmovups ZMMWORD PTR [rax], zmm0 vmovups YMMWORD PTR [rax+64], ymm0 vmovups XMMWORD PTR [rax+96], xmm0 lea rax, QWORD PTR [rax+128] vmovups XMMWORD PTR [rax-16], xmm0 sub rcx, 1 jne SHORT $LL11@copy_whilevmovups ZMMWORD PTR [rax], zmm0 vmovups YMMWORD PTR [rax+64], ymm0 vmovups XMMWORD PTR [rax+96], xmm0 mov QWORD PTR [rax+112], rdx mov DWORD PTR [rax+120], edx mov WORD PTR [rax+124], dx mov BYTE PTR [rax+126], dl |
ARM64 ASM:
17.4 | 17.7 |
mov x8,#0x3FF mov x9,sp mov w10,#1|$LL4@copy_while| strb w10,[x8,x9] sub x8,x8,#1 cbnz x8,|$LL4@copy_while| |
add x0,sp,#1 mov x2,#0x3FF mov w1,#1 bl memset |
Summary
We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. Please share your thoughts and comments with us through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.