gem5 prefetcher

DSP

2019-07-13 19:14发布生成海报

站内文章 / DSP

15865 0

最近在gem5上做预取实验，添加自己的预取算法，这里采用hardware stream prefetcher , 修改了几个bug才给实验调试通过，发文记录下实验过程。
gem5上添加自己的预取算法步骤：
（1）路径gem5-master/configs/common/Caches.py下，开启预取：

class L1Cache(Cache):
   assoc = 2
   tag_latency = 2
   data_latency = 2
   response_latency = 2
   mshrs = 4
   tgts_per_mshr = 20
   prefetcher = StridePrefetcher(degree=8, latency=1.0) #添加此行代码
   # prefetch_policy='tagged' #若是老版本， 添加此行代码

class L2Cache(Cache):
  assoc = 8
  tag_latency = 20
  data_latency = 20
  response_latency = 20
  mshrs = 20
  tgts_per_mshr = 12
  write_buffers = 8
  prefetcher = StridePrefetcher(degree=8, latency=1.0) #添加此行代码
  # prefetch_policy='tagged'  #若是老版本， 添加此行代码

（2）在路径gem5-master/src/mem/cache/prefetch/下，添加自己的预取算法：主要是stream.hh和stream.cc文件。
（3）在路径gem5-master/src/mem/cache/prefetch/下的Prefetcher.py中配置：

class StreamPrefetcher(QueuedPrefetcher):
    type = 'StreamPrefetcher'
    cxx_class = 'StreamPrefetcher'
    cxx_header = "mem/cache/prefetch/stream.hh"
    table_sets = Param.Int(16, "Number of sets in PC lookup table")
    table_assoc = Param.Int(4, "Associativity of PC lookup table")
    tableSize = Param.Int(8, "Number of sets in PC lookup table")
    distance = Param.Int(5, "Associativity of PC lookup table")
    use_master_id = Param.Bool(True, "Use master id based history")
    degree = Param.Int(4, "Number of prefetches to generate")

（4）在路径gem5-master/src/mem/cache/prefetch/下的Sconscript中配置：

Import('*')

SimObject('Prefetcher.py')

Source('base.cc')
Source('queued.cc')
Source('stride.cc')
Source('tagged.cc')
Source('stream.cc') #添加此行代码

注意：在这里若没有配置，在编译的时候会报错：

> build/X86/python/_m5/param_StreamPrefetcher_wrap.o: In function > `_wrap_StreamPrefetcherParams_create': > /home/jyf/download/gem_nvmain/gem5-master/build/X86/python/_m5/param_StreamPrefetcher_wrap.cc:4549: > undefined reference to `StreamPrefetcherParams::create()' collect2: > error: ld returned 1 exit status scons: *** [build/X86/gem5.opt] Error > 1 scons: building terminated because of errors.

原因是由于缺少以上配置，stream.cc没有生成stream.o文件，文件无法连接。在编译的过程中会生成stream.o ， sreamPrefetcher.hh(gem5-master/build/ARM/params/下)，
param_StreamPrefetcher_wrap.cc（build/ARM/python/_m5/下）。这些文件里面都有StreamPrefetcher * create()相关联。
（5）由于我找的hardware stream预取算法比较老，版本不匹配，这里还需要修改 stream.cc ,stream.hh源码，： stream.cc 中：
Addr 改为 AddrPriority

StreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,
                                    std::vector &addresses) {
    uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;
    //uint32_t core_id = pkt->req->contextId();
    //if (core_id < 0) {
    if (!pkt->req->contextId()) {
        DPRINTF(HWPrefetch, "ignoring request with no core ID");
        return;
    }
    .......


    for (uint8_t d = 1; d <= degree; d++) {
                        Addr pf_addr = table[i]->endAddr + blkSize * d;
                        AddrPriority addrp;   
                        addrp.first=pf_addr;    
                        addresses.push_back(addrp);
                        //addresses.push_back(pf_addr);  
                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
                    }
    ......


for (uint8_t d = 1; d <= degree; d++) {
                        Addr pf_addr = table[i]->endAddr - blkSize * d;
                        AddrPriority addrp;   
                        addrp.first=pf_addr;   
                        addresses.push_back(addrp);  
                        //addresses.push_back(pf_addr); 
                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
                    }


    }

stream.hh中：
Addr 改为 AddrPriority void calculatePrefetch(const PacketPtr &pkt, std::vector &addresses); 这里若没有修改，会报出以下错误：子类没有实现父类的虚函数，实际上是版本不兼容的问题。

stream.cc:182:34: error: invalid new-expression of abstract class type 'StreamPrefetcher'，
   virtual void calculatePrefetch(const PacketPtr &pkt,std::vector &addresses) ;

(6) 设置cpu-type = Timing
gem5-master/configs/common/cpuConfig.py
源码分析：
cpul类型为：默认是detailed

_cpu_aliases_all = [
    ("timing", "TimingSimpleCPU"),
    ("atomic", "AtomicSimpleCPU"),
    ("minor", "MinorCPU"),
    ("detailed", "DerivO3CPU"),
    ("kvm", ("ArmKvmCPU", "ArmV8KvmCPU", "X86KvmCPU")),
    ("trace", "TraceCPU"),
    ]

更改为 timing: m5.objects.TimingSimpleCPU,

def config_etrace(cpu_cls, cpu_list, options):
    if issubclass(cpu_cls, m5.objects.TimingSimpleCPU):
        # Assign the same file name to all cpus for now. This must be
        # revisited when creating elastic traces for multi processor systems.
        for cpu in cpu_list:
            # Attach the elastic trace probe listener. Set the protobuf trace
            # file names. Set the dependency window size equal to the cpu it
            # is attached to.
            cpu.traceListener = m5.objects.ElasticTrace(
                                instFetchTraceFile = options.inst_trace_file,
                                dataDepTraceFile = options.data_trace_file,
                                depWindowSize = 3 * cpu.numROBEntries)
            # Make the number of entries in the ROB, LQ and SQ very
            # large so that there are no stalls due to resource
            # limitation as such stalls will get captured in the trace
            # as compute delay. For replay, ROB, LQ and SQ sizes are
            # modelled in the Trace CPU.
            cpu.numROBEntries = 512;
            cpu.LQEntries = 128;
            cpu.SQEntries = 128;
    else:
        fatal("%s does not support data dependency tracing. Use a CPU model of"
              " type or inherited from TimingSimpleCPU.", cpu_cls)

（7）重新编译： sudo scons EXTRAS=../nvmain ./build/ARM/gem5.opt 这里可能还会报错:

No module name specified using %module or -module.
scons: *** [build/ARM/python/_m5/param_VirtIO9PBase_wrap.cc] Error 1

让人摸不着头脑，最后，把之前编译的都删除了 rm -rf ARM 重新编译，这次编译成功。
注意：编译的过程中，若有任何改动源码的部分，最好删除重新编译，不然会报些很莫名其妙的错误。下面附上stream.cc和stream.hh源码：
stream.cc

#include "debug/HWPrefetch.hh"
#include "mem/cache/prefetch/stream.hh"

StreamPrefetcher::StreamPrefetcher(const StreamPrefetcherParams *p)
: QueuedPrefetcher(p),
  tableSize(p->tableSize),
  useMasterId(p->use_master_id),
  degree(p->degree),
  distance(p->distance) {
    for(int i=0; inew StreamTableEntry*[tableSize];
        for(int j=0; jnew StreamTableEntry[tableSize];
            StreamTable[i][j]->LRU_index = j;
            resetEntry(StreamTable[i][j]);
        }
    }
}
StreamPrefetcher::~StreamPrefetcher() {
     for (int i = 0; i < MaxContexts; i++) {
            for (int j = 0; j < tableSize; j++) {
                delete[] StreamTable[i][j];
            }
        }
};

// Training and Prefetching of streams
void
StreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,
                                    std::vector &addresses) {
    uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;
    //uint32_t core_id = pkt->req->contextId();
    //if (core_id < 0) {
    if (!pkt->req->contextId()) {
        DPRINTF(HWPrefetch, "ignoring request with no core ID");
        return;
    }
    Addr blk_addr = pkt->getAddr() & ~(Addr)(blkSize-1); // cache block aligned address.
    assert(core_id < MaxContexts);
    StreamTableEntry** table;
    table = StreamTable[core_id];                          // Per core stream training.
    uint32_t i;
    // Check if there is a stream entry with the same address as blk_addr
    for (i = 0; i < tableSize; i++) {
        switch (table[i]->status) {
        case MONITOR:
            if(table[i]->trainedDirection == ASCENDING) {
                // Ascending order
                if((table[i]->startAddr < blk_addr ) && ( table[i]->endAddr > blk_addr)) {
                    // Hit to a stream, which is monitored. Issue prefetch requests based on the degree and the direction
                    for (uint8_t d = 1; d <= degree; d++) {
                        Addr pf_addr = table[i]->endAddr + blkSize * d;
                        addresses.push_back(AddrPriority(pf_addr,0));
                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
                    }
                    if((table[i]->endAddr + blkSize * degree) - table[i]->startAddr <= distance) {
                        table[i]->endAddr   = table[i]->endAddr + blkSize * degree;
                    } else {
                        table[i]->startAddr = table[i]->startAddr + blkSize * degree;
                        table[i]->endAddr   = table[i]->endAddr   + blkSize * degree;
                    }
                    break;
                }
            } else if(table[i]->trainedDirection == DESCENDING) {
                // Descending order
                if((table[i]->startAddr > blk_addr ) && (table[i]->endAddr < blk_addr)) {
                    for (uint8_t d = 1; d <= degree; d++) {
                        Addr pf_addr = table[i]->endAddr - blkSize * d;
                        addresses.push_back(AddrPriority(pf_addr,0));
                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.
", pf_addr);
                    }
                    if(table[i]->startAddr - (table[i]->endAddr - blkSize * degree) <= distance){
                        table[i]->endAddr   = table[i]->endAddr - blkSize * degree;
                    } else {
                        table[i]->startAddr = table[i]->startAddr - blkSize * degree;
                        table[i]->endAddr   = table[i]->endAddr   - blkSize * degree;
                    }
                    break;
                }
            } else{
                assert(0);
            }
            break;
        case TRAINING:
            if ((abs(table[i]->allocAddr - blk_addr) <= (distance/2) * blkSize) ){
                // Check whether the address is in +/- of distance
                if(table[i]->trendDirection[0] == INVALID){
                    table[i]->trendDirection[0] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;
                } else {
                    assert(table[i]->trendDirection[1] == INVALID);
                    table[i]->trendDirection[1] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;
                    if(table[i]->trendDirection[0] == table[i]->trendDirection[1]) {
                        table[i]->trainedDirection = table[i]->trendDirection[0];
                        table[i]->startAddr = table[i]->allocAddr;
                        if(table[i]->trainedDirection != INVALID){
                            // Based on the trainedDirection (+1:Ascending, -1:Descending) update the end address of a stream
                            table[i]->endAddr = blk_addr + (table[i]->trainedDirection) * blkSize * degree;
                        }
                        // Entry is ready for issuing prefetch requests
                        table[i]->status = MONITOR;
                    } else {
                        resetEntry(table[i]);
                    }
                }
                break;
            }
            break;
        default:
            break;
        }  // End of Switch
    }  // End of for loop
    uint32_t HIT_index=i;
    int INVALID_index = tableSize;
    for (int i=0; i//find empty entry
        if(table[i]->status==INV) {
            INVALID_index = i;
            break;
        }
    }
    int TEMP_index = -1;
    int LRU_index = -1000000;
    for (int i=0; i//find empty entry
        if(table[i]->LRU_index > TEMP_index) {
            TEMP_index = table[i]->LRU_index;
            LRU_index  = i;
        }
    }
    assert(TEMP_index == tableSize - 1);
    int entry_id;
    if(HIT_index!=tableSize) {  //hit
        entry_id = HIT_index;
    } else if (INVALID_index!=tableSize) {
        //Existence of invalid streams
        assert(table[INVALID_index]->status == INV);
        table[INVALID_index]->status = TRAINING;
        table[INVALID_index]->allocAddr = blk_addr;
        entry_id = INVALID_index;
    } else {
        //Replace the LRU stream-entry
        assert(table[LRU_index]->status!=INV);
        resetEntry(table[LRU_index]);
        table[LRU_index]->status = TRAINING;
        table[LRU_index]->allocAddr = blk_addr;
        entry_id = LRU_index;
    }
    // Shifting the table entries after the eviction of lru-id
    for (int i=0; iif(table[i]->LRU_index < table[entry_id]->LRU_index){
            table[i]->LRU_index = table[i]->LRU_index + 1;
        }
    }
    table[entry_id]->LRU_index = 0;

}

void
StreamPrefetcher::resetEntry(StreamTableEntry *this_entry)
{

    this_entry->status                = INV;
    this_entry->trendDirection[0]     = INVALID;
    this_entry->trendDirection[1]     = INVALID;
    this_entry->allocAddr             = 0;
    this_entry->startAddr             = 0;
    this_entry->endAddr               = 0;
    this_entry->trainedDirection      = INVALID;

}

StreamPrefetcher*
StreamPrefetcherParams::create()
{
    return new StreamPrefetcher(this);
}

stream.hh


#ifndef __MEM_CACHE_PREFETCH_STREAM_HH__
#define __MEM_CACHE_PREFETCH_STREAM_HH__

#include "mem/cache/prefetch/queued.hh"
#include "params/StreamPrefetcher.hh"
// Direction of stream for each stream entry in the stream table
enum StreamDirection{
        ASCENDING = 1,                      // For example - A, A+1, A+2
        DESCENDING = -1,                    // For example - A, A-1, A-2
        INVALID = 0
};
// Status of a stream entry in the stream table.
enum StreamStatus{
            INV       = 0,
            TRAINING  = 1,                  // Stream training is not over yet. Once trained will move to MONITOR status
            MONITOR   = 2                   // Monitor and Request: Stream entry ready for issuing prefetch requests
};

class StreamPrefetcher : public QueuedPrefetcher {
  protected:
    static const uint32_t MaxContexts = 64; // Creates per-core stream tables for upto 64 processor cores
    uint32_t tableSize;                     // Number of entries in a stream table
    const bool useMasterId;                 // Use the master-id to train the streams
    uint32_t degree;                        // Determines the number of prefetch reuquests to be issued at a time
    uint32_t distance;                      // Determines the prefetch distance

   /* StreamTableEntry 
     Stores the basic attributes of a stream table entry.
   */
  class StreamTableEntry {

      public:
        int  LRU_index;
        Addr allocAddr;                     // Address that initiated the stream training
        Addr startAddr;                     // First address of a stream
        Addr endAddr;                       // Last address of a stream
        StreamDirection trainedDirection;   // Direction of trained stream (Ascending or Descending)
        StreamStatus    status;             // Status of the stream entry
        StreamDirection trendDirection[2];  // Stores the last two stream directions of an entry

  };
  void resetEntry (StreamTableEntry *this_entry);

  /* Creating a StreamTable for each core with 
     Tablesize as the number of stream entries 
  */
  StreamTableEntry **StreamTable[MaxContexts];

  public:
  StreamPrefetcher(const StreamPrefetcherParams *p);
  ~StreamPrefetcher();
  /* Function called by cache controller to initiate 
     the stream training process
  */
  void calculatePrefetch(const PacketPtr &pkt, std::vector &addresses);
};

#endif // __MEM_CACHE_PREFETCH_STREAM_HH__

参考：
gem5预取实验
在添加自己的预取实验的过程中，可以参考gem5自带的一些预取算法的实现：stride.cc ,stride.hh 等。

gem5 prefetcher

Ta的文章更多 >>

热门文章

gem5 prefetcher

Ta的文章 更多 >>

热门文章

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

Ta的文章更多 >>