（讨论）ARM的DSP 函数效率

2019-07-20 21:48发布

该问题目前已经被作者或者管理员关闭, 无法添加新回复

15条回答

taizonglai

1楼-- · 2019-07-21 00:26

没用过F4的DSP功能，但是感觉第一个计算方法，编译器会优化，所以会快，可以用浮点数相乘做一次测试

czdspeed

2楼-- · 2019-07-21 03:36

精彩回答 2 元偷偷看……

czdspeed

3楼-- · 2019-07-21 06:53

zuozhongkai 发表于 2016-5-23 14:57
没用过F4的DSP功能，但是感觉第一个计算方法，编译器会优化，所以会快，可以用浮点数相乘做一次测试

目前还是需要用arm改的，我这里测试完了在总结一下arm的dsp核和dsp的区别。

jeff_梁

4楼-- · 2019-07-21 09:08

精彩回答 2 元偷偷看……

yyx112358

5楼-- · 2019-07-21 14:31

本帖最后由 yyx112358 于 2016-5-25 17:54 编辑

DSP库函数通常来说计算之前做了很多初始化，主要是在进行4字节对齐。所以这个只有在次数很大时才能体现优势
不过arm_add_f32这个库函数应该确实没什么性能上的优化，主要是方便吧。下面是源码
[mw_shl_code=c,true]void arm_add_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variabels */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
/* read four inputs from sourceA and four inputs from sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* C = A + B */ (1)
/* add and store result to destination */
*pDst = inA1 + inB1;
*(pDst + 1) = inA2 + inB2;
*(pDst + 2) = inA3 + inB3;
*(pDst + 3) = inA4 + inB4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) + (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
}[/mw_shl_code]

xkwy

6楼-- · 2019-07-21 20:15

没用过，不过我认为肯定是ARM的dsp库效率最高，没有比ARM更懂它的内核的了

http://www.keil.com/pack/doc/CMSIS/DSP/html/index.html

1 2 3 下一页

（讨论）ARM的DSP 函数效率

一周热门更多>

相关问题

相关文章

（讨论）ARM的DSP 函数效率

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

一周热门 更多>

相关问题

相关文章

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

一周热门更多>