原文: http://hi.baidu.com/twavelet/blog/item/86a91a00a2c7810b738da5e9.html Cache是一种特殊的存储器,它由Cache 存储部件和Cache控制部件组成。Cache 存储部件一般采用与CPU同类型的半导体存储器件,存取速度比内存快几倍甚至十几倍。而Cache 控制器部件包括主存地址寄存器、Cache 地址寄存器,主存—Cache地址变换部件及替换控制部件等。The
ultimate purpose of a cache is to reduce the average memory access time.
《TMS320C6000 DSP Cache User’s Guide》This document explains the
fundamentals of memory caches and describes how to efficiently utilize the TMS320C6000 DSP
two-level internal cache-based memory architecture. It shows
how to maintain coherence with external memory, how to use DMA to reduce memory latencies, and
how to optimize your code to improve cache efficiency. TMS320C64x L1 L2 Cache架构
http://volvet.blogbus.com/logs/8227255.html C64x Cache Architecture C64x CPU使用2级的Cache架构+外部内存(external memory),Level 1 Cache 按照功能分为L1 Program Cache和L1 Data Cache。每个L1 Cache的大小为16k Byte,也就是说C64x CPU有16k L1P Cache 和 16k L1D Cache。CPU访问L1Cache可以无需任何延迟。因为L1 Cache的时钟频率是和C64x CPU同样为600 M。Level 2 Memory 其实是C64x
的片内内存(addressable on chip memory),大小为1024k, 可以设置为L2 Cache或 普通内存。L2 Cache主要是对片外内存做Cache,L2 普通内存可以用来做高速访问的内存。L2 Memory的时钟频率是300M。L1D Cache Miss 访问L2 Cache的时钟延迟是8 cycle , 而访问L1D Cache Miss 访问L2
Memory的时钟延迟是 6 cycle。 另外 L2 Cache 是同时作为Program Cache和Data Cache使用的。如果L1和L2 Cache 都Miss,访问外部内存的速度则是非常慢的,因为C64x 的外部存储器时钟是100M-133M,因此要尽量避免CPU访问外部内存,才是有效提高程序性能的关键。
//
图1 Flat Versus Hierarchical Memory Architecture
L1 Cache分为独立的程序缓存 (L1P) 和数据缓存 (L1D) ,其大小各为16KBytes,访问速度与DSP内核的运行时钟相匹配,L2 Cache则采用统一的形式管理,其大小从256KB到1MB不等,访问速度相比L1 cache大大降低。L2 Cache通过DMA与外部低速的存储器件进行数据交换。 地址映像和地址变换:将主存地址映像到/转换成Cache地址,以“块”为单位进行数据交换。 Direct-Mapped Cache,每个主存块只能映像到唯一一个特定Cache块位置,一般取模mod,Multiple locations may map to a single location in the cache,也称single-way set-associative cache; N-way Set-Associative Mapping Cache,组间直接映像,组内全相联映像,multi-way set-associative cach Fully-associative cache:any memory address to be stored at any location within the cache 写回法(Write-Back)一一Cache写命中时,只修改Cache内容,并做标记为Dirty Bit,而不立即写入主存,只有此行被替换时才写回主存; Cache写失效write miss时,L2多采用写分配法(Write Allocate),主存修改后并将修改行调入Cache(即L2 write miss时,分配L2空间) 写直达(Write-through)一一can never be dirty,即Cache从不包含更新的数据 替换算法——最近最少使用算法LRU;先进先出FIFO;随机法 Cache原理或层次结构的存储系统原理:基于程序访问的局部性原理,即Principle of Locality
存储系统采取分层、分体机制,指令和数据分开,指令总线和数据总线分开三种技术提高数据带宽。Cache系统代表性的包括三种级别:
1、第一级cache (L1)位于CPU芯片上并且运算于CPU工作频率;
2、第二级cache(L2)也位于芯片上比L1速度慢而体积大;
3、第三级cache(L3)位于CPU外部,是速度最慢、体积最大的存储器。
每一级别的cahce相应执行的因素决定于Cache距离处理器的距离。下表体现了一个有代表性的Cache各自相应的时间。
表1 在一个2ns时钟周期的具有多级cache系统的处理器cache执行时间
从上表可以看出,C64x DSP的一些基本参数。其line size为32字节,而L1P大小为16K字节,故L1P中包含512条32字节的line frame;TMS320C621x/C671x的line size为64字节,而L1P大小为4K字节,故L1P Cache中包含64条64字节的line frame,内存中的地址总是映射到固定的line frame,即multiple lines in memory are mapped to the same set in the cache,而对于direct-mapped
cache,a set只有一个line frame。这就如下图所示(该图表示了cache miss 与cache hit的计算方法):
图7 C64x L1P Architecture
内存中的地址0000h到0019h总是映射到0 line frame,而3FE0h到3FFFh总是映射到511 line frame,而由于cache用尽了,所以地址4000h到4019h又必须映射到0 line frame。注意对于C64x来说,这样的L1P中的一条line frame也就正好包含了一个取指包(instrution fetch packet);对于C671x而言,a line frame包含2个取指包。
上图中的Valid bit(有效数据位)是一条line frame的有效态位,V=0表示Cache中相应的line frame无效,即不包含cached data,反之则表示有效。其中L1P没有LRU位!下图为程序地址在Cache控制器内的结构:
图8 C64x Memory Address from Cache Controller 图9 C621x/C671x L1P Address Allocation
如图8所示,当CPU要读某个地址处的指令时,该地址在Cache中被分为3个部分。其中5到13位表示了该地址映射到哪一组set(对于直接映射Cache,set=line frame号),Cache控制器再检查有效位Valid和Tag Comparison(14到31位,因为multiple lines in memory are mapped to the same set in the cache),所作的具体操作见Figure1-6。如果最后结果为0,则read miss。A read miss
also means that a line frame will be allocated for the line containing the requested address. 由于L1P是读分配(read allocate),这时L1P控制器会从L2或内存中取出该指令包,并放入L1P Cache相应的line frame中,tag被设定,并且V=1表明该set包含有效的数据,同时该指令包也会被送入CPU,此时该过程结束。
注意,利用Cache最重要的是保证只要一条line frame的内容还有用,就不要取代它,即最大化line的重复使用率Maximize cache line reuse。
解决Cache miss问题的一种方案就是建立包含多条line frame的set,即L1D Cache使用的原理2-Way Set Associative Cache。这样内存中多条具有相同Set值的地址下的指令就可以同时存在于Cache中而不会发生冲突,从而使命中率提高。
L1D Cache(2-Way Set-Associative Cache) L1D Cache是2-Way Set Associative Cache,大小为16KB(4KB for C671x)。L1D Cache的Line Size 是64 bytes(32 bytes for C621x/C671x),L1D Cache 是2-Way Set Associative Cache的意思就是同一区块,L1D Cache有两个入口可以访问,相比Direct Mapped Cache,L1D就可以大大减少Cache Conflict Miss发生的机会。同样,L1D
Cache Miss 也会打乱指令流水。相对于L1P Cache还有不同的是L1D Cache是可写的,这就有可能会涉及到写延迟,下文再仔细讨论。
//
C64x的L1D是两路组相联2-way Set-Associative的cache,下表为其基本参数:
表5 L1D Characteristics
图12 C621x/C671x L2 Address Allocation (All L2 Cache Modes)
下面以CPU要求读取一个可缓冲的外部存储地址为例,说明L2 Cache工作过程。
1、在L1(L1P或L1D)中miss,并且在L2 Cache中miss。这时,外存中相应的line会被调入L2 cache,再由LRU bits决定被放置在哪条line frame。如果该line frame包含Dirty data,则在被新的line取代时,会将其数据写回外存中相应位置。(如果该line frame也在L1D中,则在L2 line写回外存前要先由L1写回L2,这一操作叫保持cache一致性)接着,这一line再被写为L1的形式,并交至L1 cache。L1 cache在将其保存在其cache中,并交由CPU处理。注意,如果L1中放置该line的frame
line含有Dirty data,同样要先写回L2 cache。
2、如果该地址在L2 Cache 中hit。则相应line直接调入L1中并交给CPU处理。前面说过,L2 Cache是Read and write allocate,这是指当CPU要向外存中写数据时,如果L1和L2 cache miss,则会像读时那样,把对应位置的line从外部存储器中调入到L2 Cache的line frame中,而这时所作的操作也与读时类似,如果含有Dirty data,则应先写回外存。但应注意,这一line是不会出现在L1D中的,因为L1D Cache只是read-allocate,不是write
allocate。如果L2写命中,L2 cache line frame直接被CPU的写数据更新。L1P Cache:4KB,64bytes/line size
L1D Cache:
L2 Memory:包括L2 SRAM(可寻址的片上存储器)和L2 Cache(caching External Memory locations) ——Cache Misses的类型——
Capacity miss容量丢失:
解决方法①reduce the amount of data that is operated on at a time,减小操作所需的数据量;②the capacity of the cache can be increased,增加Cache的容量。
Conflict miss冲突丢失:
解决方法①change the memory layout;②we can create sets that can hold two or more lines。
Compulsory miss强制(必然)丢失:first-reference miss,这种缺失发生在第一次访问数据时,是不可避免的,除非系统对数据进行了预读取。Cache Miss(read miss,write miss):
Cache Hit(read hit,write hit):
对于每一种不命中方式,Cache控制器在将数据从存储器放入cache中时都会产生延迟。为了得到更高的性能,每一列中的内容在被取代之前应该尽可能的被重复利用。重复使用某列以此来获得不同的位置能够改善空间位置的访问,而重复使用某列可以改善时间位置的访问。这就是优化cache存储性能的一个最基本的准则。Maximize cache line reuse
基于Cache的存储系统模型:Execution Time =↓Cache Cycle Count/↑CPU Clock Rate
Optimizing Cache Performance——熟悉cache memory architecture,特别cache memory的特性如line size, associativity, capacity, replacement scheme, read/write allocation, miss pipelining, and write buffer.
Application-Level Optimizations:使用L2 Cache和DMA
Procedural-Level Optimizations:Data type减小memory带宽;chaining;避免L1P和L1D冲突丢失;避免L1D颠簸thrashing;避免容量丢失;避免write buffer stall
L2 Access Conflict 访问冲突:L2 can only service one request 一次;优先级顺序是L1P read miss;L1D read or write miss;EDMA read or write;Internal cache operations(victim writebacks, line fills, snoops),包括同时访问同一个bank发生访问冲突
L2 Bank Conflict 组冲突:Since an L2 access requires 2 cycles to complete, accesses to the same bank on consecutive cycles cause a stall. 即连续访问同一个bank发生组冲突
L1P的优化思路较为简单,主要的原则就是:尽量以循环、迭代方式实现算法,减小代码量。Cache Flush有两个动作,将cache memory中的内容写回external memory,clean cache memory;Cache Clean动作只有一个,clean the cache memoryExecute packet执行包:may contain between 1 and 8 instructions.
Fetch packet取指包:A block of 8 instructions;One fetch packet may contain multiple execute packetsMiss pipelining(Pipelined Misses):The process of servicing a single cache miss is pipelined over several cycles;overlap the processing of several misses,Associativity:The number of line frames in each set,or the number of ways
Long-distance access:CPU access external noncacheable memory
Line:A cache line is the smallest block of data that the cache operates on. 被操作的最小单元
Set组:A collection of line frames in a cache;A direct-mapped cache contains one line frame per set, and an N-way set-associative cache contains N line frames per set. A fully-associative cache has only one set that contains all of the line frames in the cache.
Way:each set in the cache contains multiple line frames;The number of line frames in each set is referred to as the number of ways in the cache
Victim Buffer:A special buffer that holds victims until they are written back.
Write Buffer:
Write merging:combine multiple independent writes into a single, larger write.如用于DMA写和L1D write
buffer 或victim buffer can merge multiple writes Memory System Coherence
1、Cache Coherence Problem
Coherence between CPU and EDMA(外设)or host accesses:If any read of a data item returns the most recently written value of that data item;A coherent memory system ensures that all writes to a given memory location are visible to future reads。
例如外设写,CPU读hit,Memory更新而Cache未更新;CPU写hit,外设读,Cache更新而Memory未更新,此时发生Cache和Memory不一致incoherence。
Consequently, if a memory location存储单元 is shared, cached, and has been modified, there is a Cache Coherence Problem发生的条件:
Multiple devices (CPUs, peripherals, DMA controllers) share a region of memory for the purpose of data exchange;
This memory region is cacheable by at least one device;
A memory location in this region has been cached;
And this memory location is modified (by any device)2、Snoop Commands:低级的存储器检查请求的地址是否cached(valid)在高级的存储器中
L1D Snoop Command (C64x devices only):
Writes back a line from L1D to L2 SRAM/cache
Used for DMA reads of L2 SRAM
L1D Snoop-Invalidate Command:
Writes back a line from L1D to L2 SRAM/cache and invalidates it in L1D
Used for DMA writes to L2 SRAM and user-controlled cache operations
L1P Invalidate Command:
Invalidates a line in L1P
Used for DMA write of L2 SRAM and user-controlled cache operations注意:DMA is not allowed to access addresses that map to L2 cache.
3、Cache Coherence Protocol:DMA Accesses to L2 SRAM 图13 DMA Write to L2 SRAM
*)If line is dirty it is first written back to L2 SRAM and merged with the new data written by the DMA.
DMA write:snoop命令包括L1D写回-使无效;L1P使无效 而DMA read:写回
*) A snoop command is sent on C64x DSP, the line is written back and kept valid.
On C621x/C671x DSP, a snoop–invalidate command is sent which additionaly invalidates the line in L1D. 图14 DMA Read of L2 SRAM 4、解决Cache Incoherence的方法:1) Clean or flush cache memory;2) Double buffering,即ping-pong buffering;3) Disabling External Memory Caching表7 Coherence Assurances in the Two-Level Memory System Double-buffering即ping-pong buffering,have two sets of input and output buffers:one for CPU processing data and one for EDMA transfers-in-progress. 4个Buffer:InBuffA and OutBuffA 以及InBuffB and OutBuffB,保持L1D和L2
SRAM的一致性;
双缓冲的例子程序:D:CCStudio_v3.3Cache ExamplesDSKC6711L2_double_buf
外部存储器双缓冲的例子程序:D:CCStudio_v3.3Cache ExamplesDSKC6711ext_double_buf In addition to the coherence operations, it is important that all DMA buffers are aligned at an L2 cache line and are an integral multiple of cache lines large. Cache控制器操作一直whole cache line,Block的大小应该是Cache Line的整数倍,并且边界对齐 图15 Double Buffering in L2 SRAM
注:DMA写L2 SRAM对L1D snoop-invalidate;DMA读L2 SRAM没有snoops,因为它没有在L1D中被cached C621x/C671x and C64x DSPs automatically maintain cache coherence for accesses by the CPU and EDMA to L2 SRAM through a hardware cache coherence protocol based on snoop commands. 以便保持L2 SRAM和L1D的一致性;
Whenever external memory caching is enabled and the EDMA is used to transfer to/from external memory, it is your responsibility to maintain cache coherence. 即手动保持external memory和L2 Cache的一致性,以及保持L2 Cache和L1D的一致性(L2 Cache使能时)
DMA写之前写回使无效:CACHE_wbInvL2(InBuffB, BUFSIZE, CACHE_WAIT);
DMA读传输之前写回:CACHE_wbL2(OutBuffB, BUFSIZE, CACHE_WAIT);表8 DMA Scenarios With Coherence Operation RequiredMemory Access Ordering
The C6000 DSP cores may initiate up to two parallel memory operations per cycle.
表9 Program Order for Memory Operations Issued From a Single Execute Packet