(讨论)ARM的DSP 函数效率

2019-07-20 21:48发布

       我目前用f407改造一款原来公司用dsp完成的产品。       不知道论坛的网友有没有用过ARM的DSP核做过算法方面的评估。我之前做DSP算法评估的流程,一个算法写好后要先评估用了多少个乘加运算,内存开销,最后在DSP里面进行流水线优化。考核时间一般用DSP内部的两个“心跳计数器”“TSCH”“TSCL”计算实际的指令开销。


                re += *(in-i) * fircoeff;     这个在“sec”里面看了耗时是0.0000136

       dsp lib里面的函数改造了一下
       2.arm_add_f32(&IN[0], &fircoeff[0], &test[6], 6);   这个在“sec”里面看了耗时是0.0000187
    _nassert(((int)x & 7) ==0);
    _nassert(((int)y & 7) ==0);
    _nassert(nr % 8 == 0);

    #pragma MUST_ITERATE(2,4096,2);
    #pragma UNROLL(16);

                *(y+i) += m *  *(x - i);


友情提示: 此问题已得到解决,问题已经关闭,关闭后问题禁止继续编辑,回答。
2019-07-21 14:31
本帖最后由 yyx112358 于 2016-5-25 17:54 编辑

[mw_shl_code=c,true]void arm_add_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
uint32_t blkCnt; /* loop counter */
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variabels */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
/* C = A + B */
/* Add and then store the results in the destination buffer. */
/* read four inputs from sourceA and four inputs from sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* C = A + B */ (1)
/* add and store result to destination */
*pDst = inA1 + inB1;
*(pDst + 1) = inA2 + inB2;
*(pDst + 2) = inA3 + inB3;
*(pDst + 3) = inA4 + inB4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) + (*pSrcB++);
/* Decrement the loop counter */

一周热门 更多>