最近在做数据漏斗的需求,就是统计ADX调用到我们DSP系统,我们DSP系统内部,对流量的过滤情况,做一个可视化界面,供PM查看。因为存储采用的ES,之前没有在项目中使用过,这两天看官方文档恶补了下,说实话,英文的官方文档真的是最好的学习资料,写语句时遇到了问题,去论坛发帖,一个外国友人回复了,因为我在提问时没有把代码格式化,以至于可读性非常差,后来改了,可读性增强,这样对别人也是一种尊重,别人浏览到这个帖子后,也能很清晰地看懂。很喜欢这种感觉,以后以官方文档为资料,不过英语真的是硬伤,需要多学习。
外国友人给的建议
需求:统计参与竞价的PV数,条件是开始时间和结束时间,adxid,adlist不能为空。指标是对sidcount求和。其中对我来说难点在adlist不能为空这个条件。
下面我们看一下索引的mapping
{
"lm_dsp_data_search_funnel-2018.08.10": {
"mappings": {
"doc": {
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"abtest": {
"type": "keyword"
},
"adlist": {
"type": "nested",
"properties": {
"promotionid": {
"type": "integer"
},
"sn": {
"type": "integer"
}
}
},
"adxid": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"deviceidmd5": {
"type": "keyword"
},
"displocal1": {
"type": "integer"
},
"displocal2": {
"type": "integer"
},
"escount": {
"type": "long"
},
"gender": {
"type": "integer"
},
"interest": {
"type": "integer"
},
"mediaid": {
"type": "integer"
},
"os": {
"type": "integer"
},
"sid": {
"type": "keyword"
},
"sidcount": {
"type": "integer"
},
"slotid": {
"type": "integer"
},
"stime": {
"type": "long"
},
"styleids": {
"type": "integer"
},
"tags": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"userip": {
"type": "keyword"
}
}
}
}
}
}
可以发现 adlist 采用了 nested 的方式
查询全部时,返回数据的一个结构,如下图所示:
最开始想使用es的script实现,写了下面的语句:
GET lm_dsp_data_search_funnel-2018.08.10/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"adxid": "u-2cbc34ryk3v43nkdq9g"
}
},
{
"range": {
"stime": {
"gte": 1533859200000,
"lte": 1533873600000
}
}
}
],
"filter": {
"script": {
"script": "doc['adlist'].values.length > 0"
}
}
}
},
"aggs": {
"sn_count": {
"sum": {
"field": "sidcount"
}
}
}
}
上面的语句报下面的错误:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['adlist'].values.length > 0",
" ^---- HERE"
],
"script": "doc['adlist'].values.length > 0",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "lm_dsp_data_search_funnel-2018.08.10",
"node": "eMpHZINKRxeM1YsKNDcjng",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['adlist'].values.length > 0",
" ^---- HERE"
],
"script": "doc['adlist'].values.length > 0",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "No field found for [adlist] in mapping with types []"
}
}
}
]
},
"status": 500
}
从网上查资料,也没有解决,写脚本这个技能后面需要再学习一下。后来转换了想法,用别的方式实现。
方法一:
GET lm_dsp_data_search_funnel-2018.08.10/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"adxid": {
"value": "u-2cbc34ryk3v43nkdq9g"
}
}
},
{
"range": {
"stime": {
"gte": 1533859200000,
"lte": 1533898800000
}
}
},
{
"nested" : {
"path" : "adlist",
"score_mode" : "none",
"query" : {
"exists":{"field":"adlist"}
}
}
}
]
}
},
"aggs": {
"count": {
"sum": {
"field": "sidcount"
}
}
}
}
上面的会过滤掉{"adlist":[]}这样的数据,当拿不准的时候,看官方文档:
方式二:
GET lm_dsp_data_search_funnel-2018.08.10/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"adxid": {
"value": "u-2cbc34ryk3v43nkdq9g"
}
}
},
{
"range": {
"stime": {
"from": 1533891139904,
"to": null,
"include_lower": false,
"include_upper": false,
"boost": 1
}
}
},
{
"nested" : {
"path" : "adlist",
"score_mode" : "none",
"query" : {
"bool" : {
"must" : [
{ "range" : {"adlist.promotionid" : {"gte" : "0"}} }
]
}
}
}
}
]
}
},
"aggs": {
"count": {
"sum": {
"field": "sidcount"
}
}
}
}
上面的两种方法都能实现需求:
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 397105,
"max_score": 0,
"hits": []
},
"aggregations": {
"count": {
"value": 397105
}
}
}
这个需求的时间花在了梳理需求,日志埋点,把日志收集到es中,统计数据,还有就是熟悉es的查询语句,今天是周五,下周一是开始的最后一天,需要快速把语句融入到代码中,这只是其中一个20分之一的需求,任务重大,不过当梳理清楚需求后,知道怎么做,下面的工作就会轻松些,很喜欢这种在工作中学习的感觉。