if (P0) R0 = memw(R2) // conditionally load word if P0
if (!P1) jump label // conditionally jump if not P1
3.2.2 硬件循环(Hardware loops)
以0开销的方式执行循环分支,且有两个硬件分支:loop0和loop1,这两个硬件循环可以嵌套。
对于非嵌套循环,使用 loop0
对于嵌套循环,loop0 作为内循环,loop1 作为外循环
如果多于多个嵌套循环,则两个内循环使用硬件循环,外部使用软件分支
SA0 和 LC0 由 loop0 使用
SA1 和 LC1 由 loop1 使用
endloop0 和 endloop1 执行测试和分支,因此它类似于do … while循环
// 一个简单的硬件循环
loop0(start,#3) // loop 3 times : SA0=&start, LC0=3
start:
{ R0 = mpyi(R0,R0) } :endloop0
// 两个硬件循环嵌套
// Sum each row of a 100x200 matrix.
loop1(outer_start,#100)
outer_start:
R0 = #0
loop0(inner_start,#200)
inner_start:
R3 = memw(R1++#4)
{ R0 = add(R0,R3) }:endloop0
{ memw(R2++#4) = R0 }:endloop1
//Using loopN:
R1 = #20000;
loop0(start,R1) // LC0=20000, SA0=&start
start:
{ R0 = mpyi(R0,R0) } :endloop0
// Using register transfers:
R1 = #20000
LC0 = R1 // LC0=20000
R1 = #start
SA0 = R1 // SA0=&start
start:
{ R0 = mpyi(R0,R0) } :endloop0
// Using 32-bit constants:
R1 = #20000;
loop0(##start,R1) // LC0=20000, SA0=&start
...
// Sum the rows of a 100x200 matrix.
// Software pipeline the outer loop.
p0 = cmp.gt(R0,R0) // p0 = false
loop1(outer_start,#100)
outer_start:
{
if (p0) memw(R2++#4) = R0 // save the sum of a row
p0 = cmp.eq(R0,R0) // p0 = true
R0 = #0
loop0(inner_start,#200)
}
inner_start:
R3 = memw(R1++#4)
{ R0 = add(R0,R3) }:endloop0:endloop1
memw(R2++#4) = R0 // save the sum of the last row
// 跳过循环体
loop0(start,R1)
P0 = cmp.eq(R1,#0) // 这两行代码先被执行
if (P0) jump skip
start:
{ R0 = mpyi(R0,R0) } :endloop0
skip:
软件流水型循环(Software pipelined loop):缺点是需要开场白和后记// C Code
int foo(int *A, int *result)
{
int i;
for (i=0;i<100;i++) {
result[i]= A[i]*A[i];
}
}
// ASM Code
foo:
{
R3 = R1
loop0(.kernel,#98) // Decrease loop count by 2
}
R1 = memw(R0++#4) // 1st prologue stage
{
R1 = memw(R0++#4) // 2nd prologue stage
R2 = mpyi(R1,R1)
}
.falign
.kernel:
{
// 三级流水线,读一个, 算一个,保存一个
R1 = memw(R0++#4) // The load for iteration N+2
R2 = mpyi(R1,R1) // The multiply for iteration N+1
memw(R3++#4) = R2 // The store for iteration N
}:endloop0
{
R2 = mpyi(R1,R1) // 1st epilogue stage
memw(R3++#4) = R2
}
memw(R3++#4) = R2 // 2nd epilogue stage
jumpr lr
Software pipelined loop (using spNloop0)
int foo(int *A, int *result)
{
int i;
for (i=0;i<100;i++) {
result[i]= A[i]*A[i];
}
}
foo:
// load safety assumed
// When the spNloop0 instruction is executed,
// it assigns the truth value false to the predicate register P3.
// After the associated loop has executed N (N是spNloop0中的N) times,
// P3 is automatically set to true.
P3 = sp2loop0(.kernel,#102) // set up pipelined loop
R3 = R1
}
.falign
.kernel:
{
R1 = memw(R0++#4) // kernel
R2 = mpyi(R1,R1)
if (P3) memw(R3++#4) = R2
}:endloop0
jumpr lr
3.2.3 软件分支(Software branches)
// call subroutine
jump end
jumpr R1
call function
callr R2
- 返回指令(return) // subroutine return
jumpr R31 // R31 is the LR (Linker Register)
- 扩展分支(Extended branches) jump ##label // 32-bit offset
call #label // non 32-bit offset
jump label // offset size determined by assembler
3.2.4 条件跳转(Speculative jumps)
如果在条件跳转中使用.new,则称此跳转为speculative jump.
{
P0 = cmp.eq(R9,#16) // single-packet compare-and-jump
IF (P0.new) jumpr:t R11 // ... enabled by use of P0.new
// direction hint必不可少
// jump:t – The jump instruction will most often be taken
// jump:nt – The jump instruction will most often be not taken
// 可以改善其性能
}
为了高效地执行代码,而又不使用汇编语言,C编译器支持内联函数;内联函数在C代码中直接表示Hexagon处理器指令。内联函数定义了大部分Hexagon处理器指令。
int main()
{
long long v1 = 0xFFFF0000FFFF0000;
long long v2 = 0x0000FFFF0000FFFF;
long long result;
// find the minimum for each half-word in 64-bit vector
result = Q6_P_vminh_PP(v1,v2);
}
3.7 别名寄存器 Aliased registers
R29的别名:SP (Stack pointer:Points to topmost element of stack in memory.)
R30的别名:FR (Frame pointer:Points to current procedure frame on stack.)
R31的别名:LR (Link register:Stores return address of a subroutine call.)
以上寄存器的值由subroutine and stack指令隐含地修改。
SP = add(SP, #-8) // SP is alias of R29
allocframe // Modifies SP (R29) and FP (R30)
call init // Modifies LR (R31)
3.8 修改寄存器 Modifier registers##
3.8.1 间接自动增加:Indirect auto-increment
寄存器R0是值是先用后增加,与C中的for循环类似:
M1 = R1 // Set modifier register
R2 = memw(R0++M1) // The effective addr is the value of R0.
// Next, M1 is added to R0 and the result
// is stored in R0.
其效果与如下代码相当,只是它更加灵活,可以运行时修改:
R2 = memw(R3++#4) // R3 contains the effective address
// R3 is then incremented by 4
3.8.2 循环Buffer:Circular
3.8.2.1 Circular with auto-increment immediate
CS0和CS1:存储循环Buffer开始地址。
设置并访问150字节的循环Buffer:
R4.H = #0 // K = 0
R4.L = #150 // length = 150
M0 = R4
R2 = ##cbuf // start addr = cbuf
CS0 = R2
R0 = memb(R2++#4:circ(M0)) // Load byte from circ buf
// specified by M0/CS0
// inc R2 by 4 after load
// wrap R2 around if >= 150
上面的代码等效如下C代码(更新数据指针):
unsigned int fcircadd(unsigned int pointer, int offset,
unsigned int M_reg, unsigned int CS_reg)
{
unsigned int length;
int new_pointer, start_addr, end_addr;
length = (M_reg&0x01ffff); // lower 17-bits gives buffer size
new_pointer = pointer+offset;
start_addr = CS_reg;
end_addr = CS_reg + lenth;
if (new_pointer >= end_addr) {
new_pointer -= length;
} else if (new_pointer < start_addr) {
new_pointer += length;
}
return (new_pointer);
}
1) Mx寄存器的Length域是循环Buffer的大小(以字节为单位),其范围为:4~(128K-1)
2) Mx寄存器的K域设置为0
3) CSx寄存器被设置为循环Buffer的开始地址(CS0对应M0, CS1对应M1)
3.8.2.2 Circular with auto-increment register
R0 = memb(R2++I:circ(M1)) // load byte with incr of I*4 from
// circ buf specified by M1/CS1
3.8.3 位反转:Bit-reversed
位反转:是指内存访问地址在使用时执行位反转,其存储在寄存器中的内存访问地址按原来的规则变化,并不需要执行位反转。32位地址的位反转规则:
1)低16位执行交换:位0与位15交换,位1与位14交换,…
2)高16位不变 位反转应用:
1) FFT(Fast Fourier Transforms)
2) Viterbi Encoding
M1 = R7 // Set modifier register
R2 = memub(R0++M1:brev) // The address is (R0.H | bitrev(R0.L))
// The orginal R0 (not reversed) is added
// to M1 and written back to R0
3.9 Predicate registers(存储标量和向量比较指令的结果)
P1 = cmp.eq(R2, R3) // Scalar compare
if (P1) jump end // Jump to address (conditional)
R8 = P1 // Get compare status (P1 only)
P3:0 = R4 // Set compare status (P0-P3)
收敛舍入(Convergent rounding)
These instructions work as follows:
Compute (A+B) or (A-B) for AVG and NAVG respectively.
Based on the two least-significant bits of the result, add a rounding constant as
follows:
If the two LSBs are 00, add 0
If the two LSBs are 01, add 0
If the two LSBs are 10, add 0
If the two LSBs are 11, add 1
Shift the result right by one bit.
为除和平方根缩放(Scaling for divide and square-root)
在Hexagon中,浮点除和平方根是通过软件库来实现的。
R3 += sfmpy(R0,R1,P2):scale
/* Here is the recommended code sequence to acquire a mutex: */
// assume R1,R3,P0,P1 are scratch
lockMutex:
R3 = #1
lock_test_spin:
R1 = memw_locked(R0) // do normal test to wait
P1 = cmp.eq(R1,#0) // for lock to be available
if (!P1) jump lock_test_spin
memw_locked(R0,P0) = r3 // do store conditional (SC)
if (!P0) jump lock_test_spin // was LL and SC done atomically?
/* Here is the recommended code sequence to release a mutex: */
// assume mutex address is held in R0
// assume R1 is scratch
R1 = #0
memw(R0) = R1
6. 条件执行(Conditional Execution)
Hexagon条件执行:基于比较指令在4个8-bit判断寄存器(P0-P3)中的设置。
6.1 标量判断(Scalar predicates)
标量判断:由8位值表示
true: 0xFF
false:0x00
6.1.1 设置标量判断
6.1.2 使用标量判断(Consuming scalar predicates)
if (P0) jump target // jump if P0 is true
if (!P2) R2 = R5 // assign register if !P2 is true
if (P1) R0 = sub(R2,R3) // conditionally subtract if P1
if (P2) R0 = memw(R2) // conditionally load word if P2
// mux selects either Rs or Rt based on the least significant bit in Ps.
// If the least-significant bit in Ps is a 1, then Rd is set to Rs,
// otherwise it is set to Rt.
Rd = mux(Ps,Rs,Rt)
// 逻辑与(AND)
{
P0 = cmp(A) // if A && B then jump
P0 = cmp(B)
if (P0.new) jump:T taken_path
}
// 逻辑非(NOT)
Pd = !cmp.{eq,gt}(Rs, {#s10,Rt} )
Pd = !cmp.gtu(Rs, {#u9,Rt} )
Pd = !tstbit(Rs, {#u5,Rt} )
Pd = !bitsclr(Rs, {#u6,Rt} )
Pd = !bitsset(Rs,Rt)
// .new 操作
//C statement
if (R2 == 4)
R3 = *R4;
else
R5 = 5;
//Assembly code
{
P0 = cmp.eq(R2,#4)
if (P0.new) R3 = memw(R4)
if (!P0.new) R5 = #5
}
// 同时使用新值和旧值
{
P0 = cmp.eq(R2,#4)
if (P0.new) R3 = memw(R4) // use newly-generated P0 value
if (P0) R5 = #5 // use previous P0 value
}
6.2 向量判断 (Vector predicates)
6.2.1 向量比较 (Vector compare)
6.2.2 向量多路指令(Vector mux instruction)
根据Px的每个位选择对应的字节。
R1:0 = vmux(P0,R3:2,R5:4) // choose bytes from R3:2 if true
R1:0 = vmux(P0,R5:4,R3:2) // choose bytes from R3:2 if false
使用向量判断
// Consider the following C statement:
for (i=0; i<8; i++) {
if (A[i]) {
B[i] = C[i];
}
}
// Assuming arrays of bytes, this code can be vectorized as follows:
R1:0 = memd(R_A) // R1:0 holds A[7]-A[0]
R3 = #0 // clear R3:2
R2 = #0
P0 = vcmpb.eq(R1:0,R3:2) // compare bytes in A to zero
R5:4 = memd(R_B) // R5:4 holds B[7]-B[0]
R7:6 = memd(R_C) // R7:6 holds C[7]-C[0]
R3:2 = vmux(P0,R7:6,R5:4) // if (A[i]) B[i]=C[i]
memd(R_B) = R3:2 // store B[7]-B[0]
向量计算指令(Vector compute instructions)
■ .b for signed byte
■ .ub for unsigned byte
■ .h for signed halfword
■ .uh for unsigned halfword
■ .w for signed word
■ .uw for unsigned word
v0.b = vadd(v1.b,v2.b) // Add vectors of bytes
v1:0.b = vadd(v3:2.b, v5:4.b) // Add vector pairs of bytes
v1:0.h = vadd(v3:2.h, v5:4.h) // Add vector pairs of halfwords
v5:4.w = vmpy(v0.h,v1.h) // Widening vector 16x16 to 32
// multiplies: halfword inputs,
// word outputs
8.2 寄存器(Registers)
32个512位数据寄存器:V0 … V31
V1 = vmem(R0) // load 512 bits of data
// from address R0
V4.w = vadd(V2.w, V3.w) // add each word in V2
// to corresponding word in V3
V5:4.w = vadd(V3:2.w, V1:0.w) // add each word in V1:0 to
// corresponding word in V3:2
HVX与L2 Cache或L2 TCM一起工作;对于L2 cache 数据,内存被标记为L2 Cacheable;对于位于L2 TCM中的数据,需要被标记为uncached.
VMEM(R0) = V1; // Store to R0 & ~(0x3F)
V0 = VMEMU(R0); // Load a vector of bytes starting at R0
// regardless of alignment
8.4 HVX指令简介
8.4.1 向量读取/存储 (Vector load/store)
// The immediate increment and post increments
// values correspond to multiples of vector length.
V2 = vmem(R1+#4) // address R1 + 4 * (vector-size) bytes
V2 = vmem(R1++M1) // address R1, post-modify by the value of M1
“load-temp” and “load-current”
1)load-temp和load-current允许读取的数据在同一个指令包中使用,而Hexagon是通过.new的方式。
2)load-temp指令不会把读取的数据写入register文件
// The “load-temp” and “load-current” forms allow
// immediate use of load data in the same packet.
// A “load-temp” instruction does not write the load data to the register file.
{
V2.tmp = vmem(R1+#1) // Data loaded into a tmp
V5:4.ub = vadd(V3.ub, V2.ub) // Used the loaded data as the V2 source
V7:6.uw = vrmpy(V5:4.ub, R5.ub, #0)
}
3) load-current指令会把读取的数据写入register文件
// load-current consumes a vector ALU resource as the
// loaded data is written to the register file
{
V2.cur = vmem(R1+#1) // Data loaded into a V2
V3 = valign(V1, V2, R4) // load data used immediately
V7:6.ub = vrmpy(V5:4.ub, R5.ub,#0)
}
向量存储
vmem(R1+#1)= V20.new // Store V20 that was generated
// in the current packet
if P0 vmem(R1++M1) = V20 // Store V20 if P0 is true
if Q0 vmem(R1++M1) = V20 // Store bytes of V20 where Q0 is true
8.4.2 直方图指令(Histogram Instruction)
{
V31.tmp = VMEM(R2) // Load 64 bytes from memory
VHIST(); // Perform histogram using counters
// in VRF and indexes from temp load
}
8.4.3 指令延迟(Instruction latency)
执行一个HVX指令需要2个或4个时钟周期。
新的指令包每个2个钟周期发出一个执行请求
以下指令需要先把源数据准备好,否则延迟将发生:
1)Input to the multiplier.
2)Input to Shift/Bit Count instructions.
3)Input to Permute instructions.
4)Unaligned Store Data
5)对于以上指令,如果源数据在同一个指令包中生成, 必将有一个停顿(STALL)
V8 = VADD(V0,V0)
V0 = VADD(V8,V9) // NO STALL
V1 = VMPY(V0,R0) // STALL due to V0
V2 = VSUB(V2,V1) // NO STALL on V1
V5:4 = VUNPACK(V2) // STALL due to V2
V2 = VADD(V0,V4) // NO STALL on V4