最近在gem5上做预取实验,添加自己的预取算法,这里采用hardware stream prefetcher , 修改了几个bug才给实验调试通过,发文记录下实验过程。
gem5上添加自己的预取算法步骤:
(1)路径gem5-master/configs/common/Caches.py下,开启预取:
class L1Cache(Cache):
assoc = 2
tag_latency = 2
data_latency = 2
response_latency = 2
mshrs = 4
tgts_per_mshr = 20
prefetcher = StridePrefetcher(degree=8, latency=1.0)
class L2Cache(Cache):
assoc = 8
tag_latency = 20
data_latency = 20
response_latency = 20
mshrs = 20
tgts_per_mshr = 12
write_buffers = 8
prefetcher = StridePrefetcher(degree=8, latency=1.0)
(2)在路径gem5-master/src/mem/cache/prefetch/下,添加自己的预取算法:主要是stream.hh和stream.cc文件。
(3)在路径gem5-master/src/mem/cache/prefetch/下的Prefetcher.py中配置:
class StreamPrefetcher(QueuedPrefetcher):
type =
cxx_class =
cxx_header = "mem/cache/prefetch/stream.hh"
table_sets = Param.Int(16, "Number of sets in PC lookup table")
table_assoc = Param.Int(4, "Associativity of PC lookup table")
tableSize = Param.Int(8, "Number of sets in PC lookup table")
distance = Param.Int(5, "Associativity of PC lookup table")
use_master_id = Param.Bool(True, "Use master id based history")
degree = Param.Int(4, "Number of prefetches to generate")
(4)在路径gem5-master/src/mem/cache/prefetch/下的Sconscript中配置:
Import('*')
SimObject('Prefetcher.py')
Source('base.cc')
Source('queued.cc')
Source('stride.cc')
Source('tagged.cc')
Source('stream.cc') #添加此行代码
注意:在这里若没有配置,在编译的时候会报错:
> build/X86/python/_m5/param_StreamPrefetcher_wrap.o: In function > `_wrap_StreamPrefetcherParams_create': > /home/jyf/download/gem_nvmain/gem5-master/build/X86/python/_m5/param_StreamPrefetcher_wrap.cc:4549: > undefined reference to `StreamPrefetcherParams::create()' collect2: > error: ld returned 1 exit status scons: *** [build/X86/gem5.opt] Error > 1 scons: building terminated because of errors.
原因是由于缺少以上配置,stream.cc没有生成stream.o文件,文件无法连接。
在编译的过程中会生成stream.o , sreamPrefetcher.hh(gem5-master/build/ARM/params/下),
param_StreamPrefetcher_wrap.cc(build/ARM/python/_m5/下)。这些文件里面都有StreamPrefetcher * create()相关联。
(5)由于我找的hardware stream预取算法比较老,版本不匹配,这里还需要修改 stream.cc ,stream.hh源码,:
stream.cc 中:
Addr 改为 AddrPriority
StreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,
std::vector &addresses) {
uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;
if (!pkt->req->contextId()) {
DPRINTF(HWPrefetch, "ignoring request with no core ID");
return;
}
.......
for (uint8_t d = 1; d <= degree; d++) {
Addr pf_addr = table[i]->endAddr + blkSize * d;
AddrPriority addrp;
addrp.first=pf_addr;
addresses.push_back(addrp);
DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
}
......
for (uint8_t d = 1; d <= degree; d++) {
Addr pf_addr = table[i]->endAddr - blkSize * d;
AddrPriority addrp;
addrp.first=pf_addr;
addresses.push_back(addrp);
DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
}
}
stream.hh中:
Addr 改为 AddrPriority
void calculatePrefetch(const PacketPtr &pkt, std::vector &addresses);
这里若没有修改,会报出以下错误:子类没有实现父类的虚函数,实际上是版本不兼容的问题。
stream.cc:182:34: error: invalid new-expression of abstract class type 'StreamPrefetcher',
virtual void calculatePrefetch(const PacketPtr &pkt,std::vector &addresses) ;
(6) 设置cpu-type = Timing
gem5-master/configs/common/cpuConfig.py
源码分析:
cpul类型为:默认是detailed
_cpu_aliases_all = [
("timing", "TimingSimpleCPU"),
("atomic", "AtomicSimpleCPU"),
("minor", "MinorCPU"),
("detailed", "DerivO3CPU"),
("kvm", ("ArmKvmCPU", "ArmV8KvmCPU", "X86KvmCPU")),
("trace", "TraceCPU"),
]
更改为 timing: m5.objects.TimingSimpleCPU,
def config_etrace(cpu_cls, cpu_list, options):
if issubclass(cpu_cls, m5.objects.TimingSimpleCPU):
for cpu in cpu_list:
cpu.traceListener = m5.objects.ElasticTrace(
instFetchTraceFile = options.inst_trace_file,
dataDepTraceFile = options.data_trace_file,
depWindowSize = 3 * cpu.numROBEntries)
cpu.numROBEntries = 512;
cpu.LQEntries = 128;
cpu.SQEntries = 128;
else:
fatal("%s does not support data dependency tracing. Use a CPU model of"
" type or inherited from TimingSimpleCPU.", cpu_cls)
(7)重新编译:
sudo scons EXTRAS=../nvmain ./build/ARM/gem5.opt
这里可能还会报错:
No module name specified using %module or -module.
scons: *** [build/ARM/python/_m5/param_VirtIO9PBase_wrap.cc] Error 1
让人摸不着头脑,最后,把之前编译的都删除了
rm -rf ARM
重新编译,这次编译成功。
注意:编译的过程中,若有任何改动源码的部分,最好删除重新编译,不然会报些很莫名其妙的错误。
下面附上stream.cc和stream.hh源码:
stream.cc
StreamPrefetcher::StreamPrefetcher(const StreamPrefetcherParams *p)
: QueuedPrefetcher(p),
tableSize(p->tableSize),
useMasterId(p->use_master_id),
degree(p->degree),
distance(p->distance) {
for(int i=0; inew StreamTableEntry*[tableSize];
for(int j=0; jnew StreamTableEntry[tableSize];
StreamTable[i][j]->LRU_index = j;
resetEntry(StreamTable[i][j]);
}
}
}
StreamPrefetcher::~StreamPrefetcher() {
for (int i = 0; i < MaxContexts; i++) {
for (int j = 0; j < tableSize; j++) {
delete[] StreamTable[i][j];
}
}
};
// Training and Prefetching of streams
void
StreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,
std::vector &addresses) {
uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;
//uint32_t core_id = pkt->req->contextId();
//if (core_id < 0) {
if (!pkt->req->contextId()) {
DPRINTF(HWPrefetch, "ignoring request with no core ID");
return;
}
Addr blk_addr = pkt->getAddr() & ~(Addr)(blkSize-1); // cache block aligned address.
assert(core_id < MaxContexts);
StreamTableEntry** table;
table = StreamTable[core_id]; // Per core stream training.
uint32_t i;
// Check if there is a stream entry with the same address as blk_addr
for (i = 0; i < tableSize; i++) {
switch (table[i]->status) {
case MONITOR:
if(table[i]->trainedDirection == ASCENDING) {
// Ascending order
if((table[i]->startAddr < blk_addr ) && ( table[i]->endAddr > blk_addr)) {
// Hit to a stream, which is monitored. Issue prefetch requests based on the degree and the direction
for (uint8_t d = 1; d <= degree; d++) {
Addr pf_addr = table[i]->endAddr + blkSize * d;
addresses.push_back(AddrPriority(pf_addr,0));
DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
}
if((table[i]->endAddr + blkSize * degree) - table[i]->startAddr <= distance) {
table[i]->endAddr = table[i]->endAddr + blkSize * degree;
} else {
table[i]->startAddr = table[i]->startAddr + blkSize * degree;
table[i]->endAddr = table[i]->endAddr + blkSize * degree;
}
break;
}
} else if(table[i]->trainedDirection == DESCENDING) {
// Descending order
if((table[i]->startAddr > blk_addr ) && (table[i]->endAddr < blk_addr)) {
for (uint8_t d = 1; d <= degree; d++) {
Addr pf_addr = table[i]->endAddr - blkSize * d;
addresses.push_back(AddrPriority(pf_addr,0));
DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
}
if(table[i]->startAddr - (table[i]->endAddr - blkSize * degree) <= distance){
table[i]->endAddr = table[i]->endAddr - blkSize * degree;
} else {
table[i]->startAddr = table[i]->startAddr - blkSize * degree;
table[i]->endAddr = table[i]->endAddr - blkSize * degree;
}
break;
}
} else{
assert(0);
}
break;
case TRAINING:
if ((abs(table[i]->allocAddr - blk_addr) <= (distance/2) * blkSize) ){
// Check whether the address is in +/- of distance
if(table[i]->trendDirection[0] == INVALID){
table[i]->trendDirection[0] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;
} else {
assert(table[i]->trendDirection[1] == INVALID);
table[i]->trendDirection[1] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;
if(table[i]->trendDirection[0] == table[i]->trendDirection[1]) {
table[i]->trainedDirection = table[i]->trendDirection[0];
table[i]->startAddr = table[i]->allocAddr;
if(table[i]->trainedDirection != INVALID){
// Based on the trainedDirection (+1:Ascending, -1:Descending) update the end address of a stream
table[i]->endAddr = blk_addr + (table[i]->trainedDirection) * blkSize * degree;
}
// Entry is ready for issuing prefetch requests
table[i]->status = MONITOR;
} else {
resetEntry(table[i]);
}
}
break;
}
break;
default:
break;
} // End of Switch
} // End of for loop
uint32_t HIT_index=i;
int INVALID_index = tableSize;
for (int i=0; i//find empty entry
if(table[i]->status==INV) {
INVALID_index = i;
break;
}
}
int TEMP_index = -1;
int LRU_index = -1000000;
for (int i=0; i//find empty entry
if(table[i]->LRU_index > TEMP_index) {
TEMP_index = table[i]->LRU_index;
LRU_index = i;
}
}
assert(TEMP_index == tableSize - 1);
int entry_id;
if(HIT_index!=tableSize) { //hit
entry_id = HIT_index;
} else if (INVALID_index!=tableSize) {
//Existence of invalid streams
assert(table[INVALID_index]->status == INV);
table[INVALID_index]->status = TRAINING;
table[INVALID_index]->allocAddr = blk_addr;
entry_id = INVALID_index;
} else {
//Replace the LRU stream-entry
assert(table[LRU_index]->status!=INV);
resetEntry(table[LRU_index]);
table[LRU_index]->status = TRAINING;
table[LRU_index]->allocAddr = blk_addr;
entry_id = LRU_index;
}
// Shifting the table entries after the eviction of lru-id
for (int i=0; iif(table[i]->LRU_index < table[entry_id]->LRU_index){
table[i]->LRU_index = table[i]->LRU_index + 1;
}
}
table[entry_id]->LRU_index = 0;
}
void
StreamPrefetcher::resetEntry(StreamTableEntry *this_entry)
{
this_entry->status = INV;
this_entry->trendDirection[0] = INVALID;
this_entry->trendDirection[1] = INVALID;
this_entry->allocAddr = 0;
this_entry->startAddr = 0;
this_entry->endAddr = 0;
this_entry->trainedDirection = INVALID;
}
StreamPrefetcher*
StreamPrefetcherParams::create()
{
return new StreamPrefetcher(this);
}
stream.hh
#ifndef __MEM_CACHE_PREFETCH_STREAM_HH__
#define __MEM_CACHE_PREFETCH_STREAM_HH__
#include "mem/cache/prefetch/queued.hh"
#include "params/StreamPrefetcher.hh"
enum StreamDirection{
ASCENDING = 1,
DESCENDING = -1,
INVALID = 0
};
enum StreamStatus{
INV = 0,
TRAINING = 1,
MONITOR = 2
};
class StreamPrefetcher : public QueuedPrefetcher {
protected:
static const uint32_t MaxContexts = 64;
uint32_t tableSize;
const bool useMasterId;
uint32_t degree;
uint32_t distance;
class StreamTableEntry {
public:
int LRU_index;
Addr allocAddr;
Addr startAddr;
Addr endAddr;
StreamDirection trainedDirection;
StreamStatus status;
StreamDirection trendDirection[2];
};
void resetEntry (StreamTableEntry *this_entry);
StreamTableEntry **StreamTable[MaxContexts];
public:
StreamPrefetcher(const StreamPrefetcherParams *p);
~StreamPrefetcher();
void calculatePrefetch(const PacketPtr &pkt, std::vector &addresses);
};
#endif
参考:
gem5预取实验
在添加自己的预取实验的过程中,可以参考gem5自带的一些预取算法的实现:stride.cc ,stride.hh 等。