测试hbase和hadoop操作文件的性能

2019-04-15 18:49发布

  测试hbase和hadoop操作文件的性能
1:单线程hbase的文件存入
        String parentPath = "F:/pic/2003-zhujiajian";
        File[] files = getAllFilePath(parentPath);
       
        HBaseConfiguration config = new HBaseConfiguration();
        HTable table = new HTable(config, new Text("offer"));
        long start = System.currentTimeMillis();
        for (File file :files) {
            if(file.isFile()) {
                byte[] data = getData(file);
                createRecore(table,file.getName(),"image_big",data);
            }
        }
        long end = System.currentTimeMillis();
        System.out.println("time cost=" + (end-start));
 108037206 bytes, 303个files write from local windows to remote hbase,cost 23328 or 21001 milliseconds
2:单线程hadoop的文件存入
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path src = new Path("F:/pic/2003-zhujiajian");
        Path dst = new Path("/user/zxf/image");
        long start = System.currentTimeMillis();
        fs.copyFromLocalFile(src, dst);
        long end = System.currentTimeMillis();
        System.out.println("time cost=" + (end-start));
 108037206 bytes, 303 files write from local windows to remote hdfs,cost 26531 or 32407 milliseconds
 
3:单线程hbase的文件读取
 
  花费的时间慢的难以置信
  108037206 bytes, 303 files read from hdfs to local cost 479350 milliseconds
 
4:单线程hadoop的文件读取
 108037206 bytes, 303 files read from hdfs to local cost 14188 milliseconds 5:深入测试
 取几个文件对比
 fileSize(byte)  hdfs time(ms) hbase time(ms)
 12341140        1313          14688
 708474          63            4359
 82535           15            3907
 55296           16            125 6 思考
  测试期间发生了一个region offline的错误,重启服务也还是报错,后然重新format namenode, delete datanode上数据,重启发现还有datanode没有起来,ssh上去发现java进程死了
  浪费了1个多小时,仔细想了一下 HTable分散到各个HRegionServer上的各子表,一台datanode挂了,当有数据请求时,连不上,所以报region offline错误
 
 为什么hbase读取的performance那么差?我单个读取11m的文件需要14000 milliseconds,而hdfs真个文件目录的读取才14188 milliseconds
  http://blog.rapleaf.com/dev/?p=26,这篇文章中说到
  Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase - but do us all a favor and just keep the path in the metadata.
  看来,hbase不合适存放二进制文件,存放图片这样的application还是hdfs更合适了   alter table offer change image_big IN_MEMORY;
  a:重新测试了几遍,包括重启hbase,hdfs,hbase的读取速度还是和原先没大差别
 
  b:删除原有数据,重新写入后,再测试读发现,小文件的读取效率搞了很多
  fileSize(byte)  1(ms)   2(ms)  3(ms)
  12341140        11750   11109  11718
  708474          625     610    672
  82535           78      78     78
  55296           47      62     47
  这样就是说读cache有较大的性能提升,在data数量不是非常大的时候,瓶颈是在读取速度上,100k一下的数据读取效率还是可以的,花费时间基本上和要读取的data的长度成正比
  但是之前的效率为什么没有变?难道不能cache从磁盘读取的数据?
  然后试着读取了最先放入的一批文件中的几个,现在还是很慢,重复b的操作后效率提升了
  原因可能是系统在创建row's clunm data的时候打上了cache标志,cache适合clunm系统绑定在一起的,hbase启动的时候会把打了cache标志的colunm数据读到memory中.
  所以在我执行alter table offer change image_big IN_MEMORY之前所创建的数据都没有cache标志,  此cache不是像其他的cache,启动的时候不做load,访问后再cache,这样一来,cache的数据愈多必然造成启动速度的加慢,我这里也有 所感觉了,当然对用户体验是好的,不会在第一次访问的时候特别慢   c:那为hbase读取数据的速度为什么比hdfs慢,特别是大文件的时候慢那么多呢?过多的网络交互?
  从debug日志来看,情况的确是这样,文件越大,regionServer response clinet的次数非常多.具体还需分析源代码仔细看看了.