ElasticSearch初探

DSP

ElasticSearch初探

2019-07-13 16:41发布生成海报

站内文章 / DSP

17374 0

最近在做数据漏斗的需求，就是统计ADX调用到我们DSP系统，我们DSP系统内部，对流量的过滤情况，做一个可视化界面，供PM查看。因为存储采用的ES，之前没有在项目中使用过，这两天看官方文档恶补了下，说实话，英文的官方文档真的是最好的学习资料，写语句时遇到了问题，去论坛发帖，一个外国友人回复了，因为我在提问时没有把代码格式化，以至于可读性非常差，后来改了，可读性增强，这样对别人也是一种尊重，别人浏览到这个帖子后，也能很清晰地看懂。很喜欢这种感觉，以后以官方文档为资料，不过英语真的是硬伤，需要多学习。外国友人给的建议需求：统计参与竞价的PV数，条件是开始时间和结束时间，adxid，adlist不能为空。指标是对sidcount求和。其中对我来说难点在adlist不能为空这个条件。下面我们看一下索引的mapping

{
  "lm_dsp_data_search_funnel-2018.08.10": {
    "mappings": {
      "doc": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "abtest": {
            "type": "keyword"
          },
          "adlist": {
            "type": "nested",
            "properties": {
              "promotionid": {
                "type": "integer"
              },
              "sn": {
                "type": "integer"
              }
            }
          },
          "adxid": {
            "type": "keyword"
          },
          "age": {
            "type": "integer"
          },
          "deviceidmd5": {
            "type": "keyword"
          },
          "displocal1": {
            "type": "integer"
          },
          "displocal2": {
            "type": "integer"
          },
          "escount": {
            "type": "long"
          },
          "gender": {
            "type": "integer"
          },
          "interest": {
            "type": "integer"
          },
          "mediaid": {
            "type": "integer"
          },
          "os": {
            "type": "integer"
          },
          "sid": {
            "type": "keyword"
          },
          "sidcount": {
            "type": "integer"
          },
          "slotid": {
            "type": "integer"
          },
          "stime": {
            "type": "long"
          },
          "styleids": {
            "type": "integer"
          },
          "tags": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "userip": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

可以发现 adlist 采用了 nested 的方式查询全部时，返回数据的一个结构，如下图所示：

最开始想使用es的script实现，写了下面的语句:

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 10, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "adxid": "u-2cbc34ryk3v43nkdq9g"
          }
        },
        {
          "range": {
            "stime": {
              "gte": 1533859200000,
              "lte": 1533873600000
            }
          }
        }
      ],
      "filter": {
        "script": {
        "script": "doc['adlist'].values.length > 0"
          }
      }
    }
  }, 
  "aggs": {
    "sn_count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的语句报下面的错误：

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
          "doc['adlist'].values.length > 0",
          "    ^---- HERE"
        ],
        "script": "doc['adlist'].values.length > 0",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "lm_dsp_data_search_funnel-2018.08.10",
        "node": "eMpHZINKRxeM1YsKNDcjng",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
            "doc['adlist'].values.length > 0",
            "    ^---- HERE"
          ],
          "script": "doc['adlist'].values.length > 0",
          "lang": "painless",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [adlist] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 500
}

从网上查资料，也没有解决，写脚本这个技能后面需要再学习一下。后来转换了想法，用别的方式实现。方法一：

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "adxid": {
              "value": "u-2cbc34ryk3v43nkdq9g"
            }
          }
        },
        {
          "range": {
            "stime": {
              "gte": 1533859200000,
              "lte": 1533898800000
            }
          }
        },
        {
          "nested" : {
            "path" : "adlist",
            "score_mode" : "none",
            "query" : {
              "exists":{"field":"adlist"}
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的会过滤掉{"adlist":[]}这样的数据，当拿不准的时候，看官方文档：

方式二：

GET lm_dsp_data_search_funnel-2018.08.10/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "adxid": {
              "value": "u-2cbc34ryk3v43nkdq9g"
            }
          }
        },
        {
          "range": {
            "stime": {
              "from": 1533891139904,
              "to": null,
              "include_lower": false,
              "include_upper": false,
              "boost": 1
            }
          }
        },
        {
          "nested" : {
            "path" : "adlist",
            "score_mode" : "none",
            "query" : {
                "bool" : {
                    "must" : [
                      { "range" : {"adlist.promotionid" : {"gte" : "0"}} }
                    ]
                }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "count": {
      "sum": {
        "field": "sidcount"
      }
    }
  }
}

上面的两种方法都能实现需求：

{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 397105,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "count": {
      "value": 397105
    }
  }
}

这个需求的时间花在了梳理需求，日志埋点，把日志收集到es中，统计数据，还有就是熟悉es的查询语句，今天是周五，下周一是开始的最后一天，需要快速把语句融入到代码中，这只是其中一个20分之一的需求，任务重大，不过当梳理清楚需求后，知道怎么做，下面的工作就会轻松些，很喜欢这种在工作中学习的感觉。

ElasticSearch初探

Ta的文章更多 >>

热门文章

ElasticSearch初探

Ta的文章 更多 >>

热门文章

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

Ta的文章更多 >>