MSVC ARM64 optimizations in Visual Studio 2022 17.8
MSVC ARM64 optimizations in Visual Studio 2022 17.8
Visual Studio 2022 17.8 has been released recently (download it here). While there is already a blog “Visual Studio 17.8 now available!” covering new features and improvements, we would like to share more information with you about what is new for the MSVC ARM64 backend in this blog. In the last couple of months, we have been improving code-generation for the auto-vectorizer so that it can generate Neon instructions for more cases. Also, we have optimized instruction selection for a few scalar code-generation scenarios, for example short circuit evaluation, comparison against immediate, and smarter immediate split for logic instruction.
Auto-Vectorizer supports conversions between floating-point and integer
The following conversions between floating-point and integer types are common in real-world code. Now, they are all enabled in the ARM64 backend and hooked up with the auto-vectorizer.
From | To | Instruction |
double | float | fcvtn |
double | int64_t | fcvtzs |
double | uint64_t | fcvtzu |
float | double | fcvtl |
float | int32_t | fcvtzs |
float | uint32_t | fcvtzu |
int64_t | double | scvtf |
uint64_t | double | ucvtf |
int32_t | float | scvtf |
uint32_t | float | ucvtf |
For example:
void test (double * __restrict a, unsigned long long * __restrict b)
{
for (int i = 0; i < 2; i++)
{
a[i] = (double)b[i];
}
}
In Visual Studio 2022 17.7, the code-generation was the following in which both the computing throughput and load/store bandwidth utilization were suboptimal due to scalar instructions being used.
ldp x9, x8, [x1]
ucvtf d17, x9
ucvtf d16, x8
stp d17, d16, [x0]
In Visual Studio 2022 17.8.2, the code-generation has been optimized into:
ldr q16,[x1]
ucvtf v16.2d,v16.2d
str q16,[x0]
A single pair of Q register load & store plus SIMD instructions are used now.
The above example is a conversion between double and 64-bit integer, so both types are the same size. There was another issue in the ARM64 backend preventing auto-vectorization on conversion between different sized types and it has been fixed as well. MSVC also auto-vectorizes the following example now:
void test_df_to_sf (float * __restrict a, double * __restrict b, int * __restrict c)
{
for (int i = 0; i < 4; i++)
{
a[i] = (float) b[i];
c[i] = ((int)a[i]) << 5;
}
}
The code-generation in Visual Studio 2022 17.7 was:
ldp d17, d16, [x1]
ldp d17, d16, [x1]
fcvt s17, d17
fcvt s16, d16
fcvtzs w8, s17
stp s17, s16, [x0]
lsl w9, w8, #5
fcvtzs w8, s16
lsl w8, w8, #5
stp w9, w8, [x2]
Scalar instructions plus loop unrolling were employed. In Visual Studio 2022 17.8.2, the loop is vectorized:
ldr q16, [x1]
fcvtn v16.2s, v16.2d
str d16, [x0]
fcvtzs v16.2s, v16.2s
shl v16.2s, v16.2s, #5
str d16, [x2]
Auto-vectorizer now supports extended left shifts
Extended left shifts are also common in real world code, therefore the ARM64 ISA has native instructions to support it. Neon has SSHLL
and USHLL
to support signed and unsigned extended left shift. They extend the shift source first, then perform shift on the extended value. Let’s have a look at the following example:
void test_signed (signed short * __restrict a, signed char * __restrict b)
{
for (int i = 0; i < 8; i++)
a[i] = b[i] << 7;
}
void test_unsigned (unsigned short * __restrict a, unsigned char * __restrict b)
{
for (int i = 0; i < 8; i++)
a[i] = b[i] << 7;
}
The code-generation in Visual Studio 2022 17.7 was:
|test_signed| PROC
ldr d16, [x1]
sxtl v16.8h, v16.8b
shl v16.8h, v16.8h, #7
str q16, [x0]
|test_unsigned| PROC
ldr d16, [x1]
ushll v16.8h, v16.8b, #0
shl v16.8h, v16.8h, #7
str q16, [x0]
There is vectorization, an independent type promotion is done first and followed by a normal shift. They can be optimized into a single extended shift. We have taught the ARM64 backend about this, and Visual Studio 2022 17.8.2 now generates:
|test_signed| PROC
ldr d16, [x1]
sshll v16.8h, v16.8b, #7
str q16, [x0]
|test_unsigned| PROC
ldr d16, [x1]
ushll v16.8h, v16.8b, #7
str q16, [x0]
A single SSHLL
or USHLL
is used. But SSHLL
and USHLL
only encode shift amounts within [0, bit_size_of_shift_source - 1]
. For example, the shift amount can only be [0, 15]
for the above testcases. Therefore, we cannot use both instructions if we want to move the element to the upper half of the wider destination, because the shift amount then will be 16, which is out of the encoding range. For this case, signedness does not matter and ARM64 Neon has SHLL
(Shift Left Long – by element size) to support it.
Let us increase the shift amount to the element size of the shift source, which is 8:
void test_signed(signed short * __restrict a, signed char * __restrict b)
{
for (int i = 0; i < 8; i++)
a[i] = b[i] << 8;
}
void test_unsigned(unsigned short * __restrict a, unsigned char * __restrict b)
{
for (int i = 0; i < 8; i++)
a[i] = b[i] << 8;
}
The code-generation in Visual Studio 2022 17.7 was:
|test_signed| PROC
ldr d16, [x1]
sxtl v16.8h, v16.8b
shl v16.8h, v16.8h, #7
str q16, [x0]
|test_unsigned| PROC
ldr d16, [x1]
ushll v16.8h, v16.8b, #0
shl v16.8h, v16.8h, #7
str q16, [x0]
And in Visual Studio 2022 17.8.2, it becomes:
|test_signed| PROC
ldr d16, [x1]
shll v16.8h, v16.8b, #8
str q16, [x0]
|test_unsigned| PROC
ldr d16, [x1]
shll v16.8h, v16.8b, #8
str q16, [x0]
Scalar code-generation improved on immediate materialization for CMP/CMN
On the blog introducing ARM64 optimizations for Visual Studio 2022 17.6, there was one piece of feedback on unnecessary immediate materialization for integer comparison. In the 17.8 development cycle, we have improved a couple of relevant places inside the ARM64 backend.
For integer comparison, the ARM64 backend is now smarter and knows an immediate could be adjusted to immediate + 1 or immediate – 1 then fits into adjusted comparison. Here are some examples:
int test_ge2gt (int a)
{
if (a >= 0x10001)
return 1;
else
return 0;
}
int test_lt2le (int a)
{
if (a < -0x1fff)
return 1;
else
return 0;
}
0x10001
inside test_ge2gt
does not fit into the immediate encoding for the ARM64 CMP
instruction, either verbatim or shifted. However, if we subtract it by 1 and turn greater equal (ge) into greater (gt) accordingly, then 0x10000
will fit into the shifted encoding.
For test_lt2le, the negative immediate, -0x1fff, does not fit into immediate encoding for ARM64 CMN instruction, but if we subtract it by 1 and turn less (lt) into less equal (le) accordingly, then -0x2000 will fit into shifted encoding.
So, the code-generation is the following by Visual Studio 2022 17.7:
|test_ge2gt| PROC
mov w8, #0x10001
cmp w0, w8
csetge w0
|test_lt2le| PROC
mov w8, #-0x1FFF
cmp w0, w8
csetlt w0
There is an extra MOV
instruction to materialize the immediate into the register because it does not fit into encoding verbatim. After the above-mentioned improvements, Visual Studio 2022 17.8.2 generates:
|test_ge2gt| PROC
cmp w0, #0x10, lsl #0xC
csetgt w0
|test_lt2le| PROC
cmn w0, #2, lsl #0xC
csetle w0
The sequence is more efficient.
Scalar code-generation improved on logic immediate loading
We have also taken steps further to improve immediate handling of other instructions. One improvement is: ARM64 has a rotated encoding for logic immediate (please refer to description of DecodeBitMasks
in the Arm Architecture Reference Manual for details), this immediate encoding is used by AND
/ORR
. If one immediate does not fit into rotated encoding verbatim, it could after a split.
For example, programmers frequently write code patterns like the following:
#define FLAG1_MASK 0x80000000
#define FLAG2_MASK 0x00004000
int cal (int a)
{
return a & (FLAG1_MASK | FLAG2_MASK);
}
The compiler middle-end usually logically combines all the AND
ed immediates with the return expression then returns a & 0x80004000
which does not fit into the rotated encoding, hence a MOV
/MOVK
sequence will be generated to load the immediate, the cost will be three instructions. The code generation in Visual Studio 2022 17.7 was:
mov w8, #0x4000
movk w8, #0x8000, lsl #0x10
and w0, w0, w8
If we split 0x80004000
into 0xffffc000
and 0x80007fff
, AND
ing them sequentially will have the same effect as AND
ing 0x80004000
, but both 0xffffc000
and 0x80007fff
fit into the rotated encoding, so we save one instruction. The code generation in Visual Studio 2022 17.8.2 is:
and w8, w0, #0xFFFFC000
and w0, w8, #0x80007FFF
The immediate gets split in a way that the split parts fit into two AND
instructions. We only want to split the immediate when it has sole use site, otherwise we will end up with duplicated sequences. Therefore, the split is guided with use count.
Scalar code-generation now catches more CCMP opportunities
The CCMP
(conditional compare) instruction is useful for accelerating short circuit evaluation, for example:
int test (int a)
{
return a == 17 || a == 31;
}
For this testcase, Visual Studio 2022 17.7 was smart enough to employ CCMP
and generated:
cmp w0, #0x11
ccmpne w0, #0x1F, #4
cseteq w0
However, CCMP
only takes a 5-bit immediate, so if we change the testcase to:
int test (int a)
{
return a == 17 || a == 32;
}
The immediate #32
does not fit into CCMP
’s encoding, so the compiler will stop generating it, hence the code generation in Visual Studio 2022 17.7 was:
cmp w0, #0x11
beq |$LN3@test|
cmp w0, #0x20
mov w0, #0
bne |$LN4@test|
|$LN3@test|
mov w0, #1
|$LN4@test|
ret
It employs an if-then-else structure and is verbose. Here, the compiler should have a better cost model and knows it will still be beneficial if it moves #32
into a register and employ CCMP
’s register form. We have fixed this in Visual Studio 2022 17.8.2, and the code generation becomes:
cmp w0, #0x11
mov w8, #0x20
ccmpne w0, w8, #4
cseteq w0
ret
Using MOVI/MVNI for immediate move in smaller loops
We missed an opportunity to use shifted MOVI
/MVNI
for combining immediate move
In small loops. For example,
void vect_movi_msl (int *__restrict a, int *__restrict b, int *__restrict c) {
for (int i = 0; i < 8; i++)
a[i] = 0x1200;
for (int i = 0; i < 8; i++)
c[i] = 0x12ffffff;
}
In 17.7 release, MSVC generated:
|movi_msl| PROC
mov x9, #0x1200
movk x9, #0x1200, lsl #0x20
ldr x8, |$LN29@movi_msl|
stp x9, x9, [x0]
stp x8, x8, [x2]
stp x9, x9, [x0, #0x10]
stp x8, x8, [x2, #0x10]
Scalar MOV
/MOVK
instructions are employed, and 8 iterations are needed to initialize each array. Both immediates can be loaded into vector registers using MOVI
/MVNI
, therefore increasing the storage bandwidth and reducing the iteration number.
Shifted MOVI
shifts the immediate to the left and fills with 0s, so 0x1200
can be loaded as the following:
0x12 << 8 = 0x1200
Shifted MVNI
shifts the immediate first, then inverts the result:
~((0xED) << 0x18) = 0x12ffffff
In 17.8, MSVC generates:
|movi_msl| PROC
movi v17.4s, #0x12, lsl #8
mvni v16.4s, #0xED, lsl #0x18
stp q17, q17, [x0]
stp q16, q16, [x2]
Benefiting from vector register’s width and the employment of paired store, only one iteration is needed.
In closing
That is all for this blog, your feedback is valuable for us. Please share your thoughts and comments with us through Visual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.