DSP

X264性能优化

2019-07-13 18:40发布

一、X264性能分析测试环境测试环境:Intel Pentium4 3.00GHz  (双核cpu),开启超线程
内存:    DDR 1.00G
操作系统: Windows sever 2003 Enterprise Edition
分析软件: Intel(R) VTune(TM) Performance Analyzer 8.0(评估版lic)
编译软件: VC71+nasm0.98
Bus Speed: 800MHz
测试程序: X264 20060506 编码器1、Debug版本
编码参数:
X264 -fps -o foreman.264 forman.cif 352x288
编码400frames,编码效率:23fps左右(libx264 debug版本),35fps(libx264 release版本),提高了10fps以上,比较可观2、
编码参数:
X264 -fps --no-asm -o foreman.264 forman.cif 352x288
--no-asm,Disable all CPU optimizations即未使用mmx,mmxext, sse,sse2,3dNow,3dnow ext,altivec等汇编指令优化。
编码400frames,编码效率2.67fps(libx264 debug版本),12.67fps(libx264 release版本),提高了10fps
Clockticks per Instructions Retired (CPI)表示该程序段的平均执行一条指令所需的时钟周期数,CPI越大表示该程序段调用的浮点数操作,乘法,除法,I/O处理,系统调用或文件访问等代价昂贵的操作较多。
Instructions Retired events, 表示执行的指令数,越大表示该模块调用的较多.
Clockticks events 则表示该模块所消耗的时钟周期数,一般Clockticks events = Instructions Retired events * Clockticks per Instructions Retired (CPI),越大表示该模块消耗的时间越多,后面的Clockticks %则表示该模块的在所有程序中的时耗百分比.这里有一点需要注意:(还是举例吧),例如要分析视频编码中去块滤波器算法/程序的时耗,并不是一个 x264_frame_deblocking_filter函数的时间消耗就是所有x264编解码过程中的时间消耗,由于 x264_frame_deblocking_filter调用deblck_edge,x264_clip3(该函数也被其他函数所调用)函数,而 deblock_edge下又调用x264_deblock_v8_luma_mmxext, x264_deblock_h_luma_mmxext,x_264_deblock_h_chroma_mmxext, deblock_luma_intra_c, x264_deblock_v_chroma_mmxext(这些函数通过指针重定义的方式以适应于不同的硬件平台,比如Intel,AMD的CPU采用 不同的指令系统,其实Mplayer,FFMPEG,T264等软件都采用类似的重定义方式,已达到一个软件使用与不同构架/平台,如 arm,powerpc,x86等)等函数。那么这里如果统计去块滤波器的算法的时间消耗百分比,就需要将该函数及其所有调用的子函数的时间消耗都计算在 内,x264_deblock_****都是唯一被deblock_edge调用,但对于x264_clip3,并不仅仅是去块滤波器部分调用,那么就只 能部分计算在去块滤波器之内,至于部分是多少要根据个函数的调用次数,这里不确定。相关x264时耗分析数据后面的表格。deblock占4.3%左右,quant+dequant占3.3%左右,DCT+IDCT占1.1%左 右,主要是运动估计和运动补偿,ME中大量的sad/satd的计算,MC中的六阶滤波器tap_filter是主要时耗,具体我没有太细统计将近20% 左右,x264中由于采用了算法优化,程序优化及mmx,sse,sse2等指令优化,将原本消耗较大的去块滤波器等都有了较大程度地优化。这里再讨论一下程序性能优化技术,程序性能优化可以大致从3个部分考虑。
1、算法结构优化,实现同样的应用功能可采用多种不同的算法和方 法,比如H.264种的运动估计全搜索和快速运动估计算法,实现的编码效率基本一致,但是处理时间可以节省10~20倍,所以需要选择高效的算法。还有递 归算法非递归化,递归算法使得程序结构清晰,可读性高,但却需要执行大量的过程调用,堆栈保存等,运行效率低下。2、编译优化,现在很多编译器都实现了较强的代码优化功能,多数编译器都基于数据流分析以实现别名分析(通过变量重命名来消除数据相关,提高流水线 的执行效率),常数折叠,公共子表达式消除、冗余代码删除,循环逆转和循环展开等与体系结构无关的优化,例如GNU gcc就是个很好的编译工具。还有借用并行程序设计技术,进行相关性分析,并通过相应技术是程序具有更好的局部性以提高Cashe命中率。对于GCC中采 用-O -O2 -O3 -O4等选项选择针对速度/面积等性能优化,另外debug版本由于程序中加入较多的debug参数,影响程序效率,上面x264的debug和 release运行效率的对比可见一斑.编译优化属于静态优化,由编译器自动完成,但是编译器很难得到程序的语义信息,算法流程等信息。所以需要我们手工 编程优化以最大程度提高程序运行效率3、程序优化,包括a)使用inline函数,很多编译器支持inline关键字,减少函数调用开销却增加了代码量。b)针对程序运行平台,如 x86(Intel),Xscale,ARM,DSP等不同构架,可采用相应的汇编优化,将主要时耗部分/循环调用等,进行汇编指令优化 MMX,SSE,WiMMX,ARM/Thumb指令,DSP汇编等,或者采用专用的库函数,如针对Intel CPU/Xscale构架的嵌入式系统(PXA255,PXA270等)可使用IPP/GPP库,提高程序效率。c)对于DSP系统,由于有多个并行处理 单元,编译器会并行优化,所以需要尽量减少频繁小循环跳转,将循环展开,同时减少循环或内层循环也可以提高CPU的流线效率,尽量不断流。d)在 Switch语句中根据发生频率排序case语句,编译器对于switch语句将生成if-else-if的嵌套代码,按概率排序可提高效率 (FPGA/CPLD等逻辑器件中,采用VHDL语言描述的switch是生成多个逻辑器件,并且完全并行的)。e)减少函数调用参数. f)减少耗时的浮点数操作,除法操作等降低CPI。 SizeFunctionClockticks per Instructions Retired (CPI)Instructions Retired eventsClockticks eventsClockticks %Source File4917refine_subpel3.050938338111900000034140000006.582219909f:x264-060506x264-060506encoderme.c176x264_mc_chroma_mmxext1.463709677223200000032670000006.298802707 21502x264_me_search_ref2.51592356794200000023700000004.569379374f:x264-060506x264-060506encoderme.c880x264_pixel_satd_8x8_sse21.43551797141900000020370000003.927352652 99RTC_CheckStackVars3.56315789557000000020310000003.915784603 3296x264_pixel_satd_16x16_sse21.54047619126000000019410000003.742263867 237get_ref_mmx1.72592592681000000013980000002.695355428f:x264-060506x264-060506commoni386mc-c.c1183block_residual_write_cabac3.1586206943500000013740000002.649083232f:x264-060506x264-060506encodercabac.c6480x264_macroblock_analyse24.055555565400000012990000002.504482619f:x264-060506x264-060506encoderanalyse.c272x264_pixel_satd_4x4_mmxext1.229850746100500000012360000002.383018104 80x264_pixel_avg_w16_mmxext2.09604519853100000011130000002.145873099 232x264_mb_decimate_score1.35408560377100000010440000002.012840534f:x264-060506x264-060506encodermacroblock.c64x264_pixel_avg_w8_mmxext1.7569060775430000009540000001.839319799 2413x264_frame_deblocking_filter1.7039106155370000009150000001.76412748f:x264-060506x264-060506commonframe.c2491x264_macroblock_cache_save2.1521739134140000008910000001.717855284f:x264-060506x264-060506commonmacroblock.c656x264_center_filter_mmxext1.2118644077080000008580000001.654231014 146quant_4x42.9892473122790000008340000001.607958818f:x264-060506x264-060506encodermacroblock.c5930x264_macroblock_cache_load2.0902255643990000008340000001.607958818f:x264-060506x264-060506commonmacroblock.c206x264_cabac_encode_renorm2.1259842523810000008100000001.561686622f:x264-060506x264-060506commoncabac.c83array_non_zero_count1.1919642866720000008010000001.544334548f:x264-060506x264-060506encodermacroblock.h96memset9.464285714840000007950000001.532766499F:VS70Builds3077vccrtbldcrtsrcintelmemset.asm363predict_16x16_p1.0954356857230000007920000001.526982474f:x264-060506x264-060506commonpredict.c184x264_cabac_encode_decision2.3714285713150000007470000001.440222107f:x264-060506x264-060506commoncabac.c37_RTC_CheckEsp1.7071428574200000007170000001.382381861 3693x264_macroblock_encode2.8902439022460000007110000001.370813812f:x264-060506x264-060506encodermacroblock.c47x264_clip_uint81.3173652695010000006600000001.272485395f:x264-060506x264-060506commonclip1.h304x264_quant_4x4_core15_mmx1.6747967483690000006180000001.191509052 2091x264_mb_analyse_intra1.8440366973270000006030000001.162588929f:x264-060506x264-060506encoderanalyse.c1680x264_pixel_satd_8x16_sse21.1445086715190000005940000001.145236856 1696x264_pixel_satd_16x8_sse21.4496124033870000005610000001.081612586 164motion_compensation_chroma_mmxext1.4596774193720000005430000001.046908439f:x264-060506x264-060506commonmc.c328deblock_edge1.5940594063030000004830000000.931227948f:x264-060506x264-060506commonframe.c363predict_8x8c_p1.4537037043240000004710000000.90809185f:x264-060506x264-060506commonpredict.c176x264_macroblock_cache_mv1.6626506022490000004140000000.798195384f:x264-060506x264-060506commonmacroblock.h71x264_clip31.6666666672160000003600000000.694082943f:x264-060506x264-060506commoncommon.h121x264_macroblock_cache_ref2.3333333331530000003570000000.688298918f:x264-060506x264-060506commonmacroblock.h272x264_horizontal_filter_mmxext1.2164948452910000003540000000.682514894 1104x264_pixel_sad_x4_16x16_sse24.423076923780000003450000000.66516282 480x264_pixel_satd_8x4_sse21.4303797472370000003390000000.653594771 496x264_deblock_v8_luma_mmxext1.0666666673150000003360000000.647810747 432x264_pixel_sad_x4_8x8_mmxext1.6716417912010000003360000000.647810747 288x264_pixel_sad_16x16_sse24.608695652690000003180000000.6131066 910x264_mb_predict_mv2.3636363641320000003120000000.601538551f:x264-060506x264-060506commonmacroblock.c106bs_write12.6666666671170000003120000000.601538551f:x264-060506x264-060506commons.h224x264_sub4x4_dct_mmx1.160919542610000003030000000.584186477 211scan_zigzag_4x4full1.6727272731650000002760000000.532130256f:x264-060506x264-060506encodermacroblock.c656x264_deblock_h_luma_mmxext3.214285714840000002700000000.520562207 227predict_16x16_dc2.3783783781110000002640000000.508994158f:x264-060506x264-060506commonpredict.c496x264_pixel_satd_4x8_mmxext1.2428571432100000002610000000.503210134 960x264_pixel_ssd_16x16_sse24.315789474570000002460000000.474290011 33abs1.8604651161290000002400000000.462721962f:vs70builds3077vccrtbldcrtsrcabs.c864x264_pixel_sad_x3_16x16_sse23.391304348690000002340000000.451153913 962x264_mb_analyse_inter_p8x81.9487179491170000002280000000.439585864f:x264-060506x264-060506encoderanalyse.c3064x264_macroblock_write_cabac2.62962963810000002130000000.410665741f:x264-060506x264-060506encodercabac.c1209x264_mb_encode_8x8_chroma2.379310345870000002070000000.399097692f:x264-060506x264-060506encodermacroblock.c829memcpy11180000001980000000.381745619F:VS70Builds3077vccrtbldcrtsrcintelmemcpy.asm386predict_8x8c_dc2.52750000001890000000.364393545f:x264-060506x264-060506commonpredict.c202bs_write1.909090909990000001890000000.364393545f:x264-060506x264-060506commons.h352x264_pixel_sad_x3_8x8_mmxext2.172413793870000001890000000.364393545 144x264_pixel_sad_8x8_mmxext2.384615385780000001860000000.358609521 156predict_16x16_h2900000001800000000.347041471f:x264-060506x264-060506commonpredict.c178predict_16x16_v2.52173913690000001740000000.335473422f:x264-060506x264-060506commonpredict.c128x264_mc_copy_w16_mmx9.666666667180000001740000000.335473422 405x264_cabac_mb_mvd_cpn2.192307692780000001710000000.329689398f:x264-060506x264-060506encodercabac.c161x264_cabac_putbit1.41200000001680000000.323905373f:x264-060506x264-060506commoncabac.c304x264_dequant_4x4_mmx2.545454545660000001680000000.323905373 592x264_pixel_sad_x4_16x8_sse22.291666667720000001650000000.318121349 103x264_median2.6600000001560000000.300769275f:x264-060506x264-060506commoncommon.h398predict_4x4_ddl1.5625960000001500000000.289201226f:x264-060506x264-060506commonpredict.c272x264_add4x4_idct_mmx1.1395348841290000001470000000.283417202 418x264_cabac_mb_cbp_luma2.666666667540000001440000000.277633177f:x264-060506x264-060506encodercabac.c414predict_4x4_ddr2.285714286630000001440000000.277633177f:x264-060506x264-060506commonpredict.c405predict_4x4_vl1.777777778810000001440000000.277633177f:x264-060506x264-060506commonpredict.c1455x264_mb_predict_mv_ref16x163.692307692390000001440000000.277633177f:x264-060506x264-060506commonmacroblock.c1181x264_mb_analyse_inter_p16x164.6300000001380000000.266065128f:x264-060506x264-060506encoderanalyse.c176x264_macroblock_cache_mvd1.769230769780000001380000000.266065128f:x264-060506x264-060506commonmacroblock.h816x264_pixel_sad_x4_8x16_mmxext1.769230769780000001380000000.266065128 199scan_zigzag_4x42.045454545660000001350000000.260281104f:x264-060506x264-060506encodermacroblock.c446predict_4x4_mode_available2.25600000001350000000.260281104f:x264-060506x264-060506encoderanalyse.c1148x264_mb_analyse_inter_p16x83.142857143420000001320000000.254497079f:x264-060506x264-060506encoderanalyse.c1746x264_mb_analyse_init8.2150000001230000000.237145005f:x264-060506x264-060506encoderanalyse.c511x264_mb_analyse_intra_chroma2.733333333450000001230000000.237145005f:x264-060506x264-060506encoderanalyse.c425predict_4x4_hd1.28125960000001230000000.237145005f:x264-060506x264-060506commonpredict.c425predict_4x4_vr1.413793103870000001230000000.237145005f:x264-060506x264-060506commonpredict.c122predict_8x8c_h1.952380952630000001230000000.237145005f:x264-060506x264-060506commonpredict.c425x264_mb_encode_i4x42.105263158570000001200000000.231360981f:x264-060506x264-060506encodermacroblock.c464x264_pixel_sad_x3_16x8_sse25240000001200000000.231360981 672x264_pixel_sad_x3_8x16_mmxext2.666666667450000001200000000.231360981 297predict_4x4_hu1.772727273660000001170000000.225576956f:x264-060506x264-060506commonpredict.c120predict_8x8c_v3.083333333360000001110000000.214008907f:x264-060506x264-060506commonpredict.c464x264_deblock_h_chroma_mmxext1.166666667900000001050000000.202440858 240x264_pixel_sad_8x16_mmxext1.888888889540000001020000000.196656834 1104x264_mb_analyse_inter_p8x16333000000990000000.190872809f:x264-060506x264-060506encoderanalyse.c176x264_pixel_sad_16x8_sse23.66666666727000000990000000.190872809 194x264_cabac_encode_bypass1.19230769278000000930000000.17930476f:x264-060506x264-060506commoncabac.c836x264_cabac_mb_cbf_ctxidxinc1.87548000000900000000.173520736f:x264-060506x264-060506encodercabac.c80x264_mc_copy_w8_mmx330000000900000000.173520736 1385x264_slice_write4.83333333318000000870000000.167736711f:x264-060506x264-060506encoderencoder.c680deblock_luma_intra_c2.15384615439000000840000000.161952687f:x264-060506x264-060506commonframe.c503x264_mb_mc_0xywh1.85714285742000000780000000.150384638f:x264-060506x264-060506commonmacroblock.c134predict_4x4_dc515000000750000000.144600613f:x264-060506x264-060506commonpredict.c577x264_mb_predict_mv_16x162.530000000750000000.144600613f:x264-060506x264-060506commonmacroblock.c324plane_expand_border6.2512000000750000000.144600613f:x264-060506x264-060506commonframe.c272x264_deblock_v_chroma_mmxext1.64285714342000000690000000.133032564 123x264_sub8x8_dct_mmx1.11111111154000000600000000.11568049f:x264-060506x264-060506commoni386dct-c.c1359x264_macroblock_probe_skip2.71428571421000000570000000.109896466f:x264-060506x264-060506encodermacroblock.c305x264_cabac_mb_mvd3.415000000510000000.098328417f:x264-060506x264-060506encodercabac.c1880x264_analyse_update_cache412000000480000000.092544392f:x264-060506x264-060506encoderanalyse.c64array_non_zero4.6666666679000000420000000.080976343f:x264-060506x264-060506encodermacroblock.h271x264_mb_dequant_2x2_dc3.512000000420000000.080976343f:x264-060506x264-060506commonquant.c266mc_luma_mmx2.16666666718000000390000000.075192319f:x264-060506x264-060506commoni386mc-c.c199dct2x2dc3.2512000000390000000.075192319f:x264-060506x264-060506commondct.c149quant_2x2_dc133000000330000000.06362427f:x264-060506x264-060506encodermacroblock.c320x264_cabac_mb_cbp_chroma2.7512000000330000000.06362427f:x264-060506x264-060506encodercabac.c61_alloca_probe 0330000000.06362427F:VS70Builds3077vccrtbldcrtsrcintelchkstk.asm38x264_me_search2.512000000300000000.057840245f:x264-060506x264-060506encoderanalyse.c145x264_mb_predict_intra4x4_mode130000000300000000.057840245f:x264-060506x264-060506commonmacroblock.c194x264_mb_predict_mv_pskip 0300000000.057840245f:x264-060506x264-060506commonmacroblock.c279x264_nal_encode215000000300000000.057840245