DSP

ElasticSearch初探

2019-07-13 16:41发布

      最近在做数据漏斗的需求,就是统计ADX调用到我们DSP系统,我们DSP系统内部,对流量的过滤情况,做一个可视化界面,供PM查看。因为存储采用的ES,之前没有在项目中使用过,这两天看官方文档恶补了下,说实话,英文的官方文档真的是最好的学习资料,写语句时遇到了问题,去论坛发帖,一个外国友人回复了,因为我在提问时没有把代码格式化,以至于可读性非常差,后来改了,可读性增强,这样对别人也是一种尊重,别人浏览到这个帖子后,也能很清晰地看懂。很喜欢这种感觉,以后以官方文档为资料,不过英语真的是硬伤,需要多学习。外国友人给的建议 需求:统计参与竞价的PV数,条件是开始时间和结束时间,adxid,adlist不能为空。指标是对sidcount求和。其中对我来说难点在adlist不能为空这个条件。 下面我们看一下索引的mapping { "lm_dsp_data_search_funnel-2018.08.10": { "mappings": { "doc": { "properties": { "@timestamp": { "type": "date" }, "@version": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "abtest": { "type": "keyword" }, "adlist": { "type": "nested", "properties": { "promotionid": { "type": "integer" }, "sn": { "type": "integer" } } }, "adxid": { "type": "keyword" }, "age": { "type": "integer" }, "deviceidmd5": { "type": "keyword" }, "displocal1": { "type": "integer" }, "displocal2": { "type": "integer" }, "escount": { "type": "long" }, "gender": { "type": "integer" }, "interest": { "type": "integer" }, "mediaid": { "type": "integer" }, "os": { "type": "integer" }, "sid": { "type": "keyword" }, "sidcount": { "type": "integer" }, "slotid": { "type": "integer" }, "stime": { "type": "long" }, "styleids": { "type": "integer" }, "tags": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "userip": { "type": "keyword" } } } } } } 可以发现 adlist 采用了 nested 的方式 查询全部时,返回数据的一个结构,如下图所示:               最开始想使用es的script实现,写了下面的语句: GET lm_dsp_data_search_funnel-2018.08.10/_search { "size": 10, "query": { "bool": { "must": [ { "match": { "adxid": "u-2cbc34ryk3v43nkdq9g" } }, { "range": { "stime": { "gte": 1533859200000, "lte": 1533873600000 } } } ], "filter": { "script": { "script": "doc['adlist'].values.length > 0" } } } }, "aggs": { "sn_count": { "sum": { "field": "sidcount" } } } } 上面的语句报下面的错误: { "error": { "root_cause": [ { "type": "script_exception", "reason": "runtime error", "script_stack": [ "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)", "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)", "doc['adlist'].values.length > 0", " ^---- HERE" ], "script": "doc['adlist'].values.length > 0", "lang": "painless" } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "lm_dsp_data_search_funnel-2018.08.10", "node": "eMpHZINKRxeM1YsKNDcjng", "reason": { "type": "script_exception", "reason": "runtime error", "script_stack": [ "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)", "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)", "doc['adlist'].values.length > 0", " ^---- HERE" ], "script": "doc['adlist'].values.length > 0", "lang": "painless", "caused_by": { "type": "illegal_argument_exception", "reason": "No field found for [adlist] in mapping with types []" } } } ] }, "status": 500 } 从网上查资料,也没有解决,写脚本这个技能后面需要再学习一下。后来转换了想法,用别的方式实现。 方法一: GET lm_dsp_data_search_funnel-2018.08.10/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "adxid": { "value": "u-2cbc34ryk3v43nkdq9g" } } }, { "range": { "stime": { "gte": 1533859200000, "lte": 1533898800000 } } }, { "nested" : { "path" : "adlist", "score_mode" : "none", "query" : { "exists":{"field":"adlist"} } } } ] } }, "aggs": { "count": { "sum": { "field": "sidcount" } } } } 上面的会过滤掉{"adlist":[]}这样的数据,当拿不准的时候,看官方文档:     方式二: GET lm_dsp_data_search_funnel-2018.08.10/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "adxid": { "value": "u-2cbc34ryk3v43nkdq9g" } } }, { "range": { "stime": { "from": 1533891139904, "to": null, "include_lower": false, "include_upper": false, "boost": 1 } } }, { "nested" : { "path" : "adlist", "score_mode" : "none", "query" : { "bool" : { "must" : [ { "range" : {"adlist.promotionid" : {"gte" : "0"}} } ] } } } } ] } }, "aggs": { "count": { "sum": { "field": "sidcount" } } } } 上面的两种方法都能实现需求: { "took": 42, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 397105, "max_score": 0, "hits": [] }, "aggregations": { "count": { "value": 397105 } } } 这个需求的时间花在了梳理需求,日志埋点,把日志收集到es中,统计数据,还有就是熟悉es的查询语句,今天是周五,下周一是开始的最后一天,需要快速把语句融入到代码中,这只是其中一个20分之一的需求,任务重大,不过当梳理清楚需求后,知道怎么做,下面的工作就会轻松些,很喜欢这种在工作中学习的感觉。