Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【BUG】创建 APM 监控时报错 #1294

Open
m84641693 opened this issue Mar 1, 2024 · 25 comments
Open

【BUG】创建 APM 监控时报错 #1294

m84641693 opened this issue Mar 1, 2024 · 25 comments

Comments

@m84641693
Copy link

bk-monitor 版本(The versions used):
3.8.2

发生了什么(What happened):
在创建 APM 场景时报错(监控平台---观测场景---APM---新建应用---设置应用名称以及应用英文名---提交)
image

期望是什么(What you expected to happen):
能够创建一个 APM 应用

如何复现(How to reproduce it):
按照发现过程,100%复现。

相关的日志详情(访问日志及应用日志)和截图等(Log & Screenshot):
目前不确定需要哪些日志,如有需要将补充。

备注(Anything else we need to know):

@m84641693
Copy link
Author

  • 问题现象关键片段(文字): 7c5ef6f7f651010a3491df6193aa3cf5 : 请求系统'apm_api'错误,返回错误码: 1306201,返回消息: {'result': False, 'code': 1306201, 'data': None, 'message': 'Component request third-party system [MONITOR_V3] interface [create_apm_application] error: Third-party system interface response time exceeds 30 seconds, please try again later or contact component developer to handle this'},请求URL: /create_apm_application/

@liuwenping
Copy link
Collaborator

看着像是es创建索引问题,可以尝试去掉副本数试试

@m84641693
Copy link
Author

看着像是es创建索引问题,可以尝试去掉副本数试试

image
将副本数置 0 还是一样的错误。

@m84641693
Copy link
Author

image

@liuwenping
Copy link
Collaborator

看着像是es创建索引问题,可以尝试去掉副本数试试

image 将副本数置 0 还是一样的错误。

从两次的截图来看(空闲率和索引数),当前选择的ES正在频繁的创建索引中。较为繁忙
先排查下该ES相关的一些healthz情况,或者尝试更换一个ES再试试

@m84641693
Copy link
Author

m84641693 commented Mar 1, 2024

看着像是es创建索引问题,可以尝试去掉副本数试试

image 将副本数置 0 还是一样的错误。

从两次的截图来看(空闲率和索引数),当前选择的ES正在频繁的创建索引中。较为繁忙 先排查下该ES相关的一些healthz情况,或者尝试更换一个ES再试试

image
image

  1. 更换为蓝鲸自带的 ES 也是相同的报错;
  2. 当前使用的这个 ES 状态似乎是好的;
  3. 单独调用create_apm_application 接口,貌似也是正常的。
curl -X POST  -H 'X-Bkapi-Authorization: {"bk_app_code": "bk_monitorv3", "bk_app_secret": "a8c9f548-e919-4d16-a3ec-529d8e03a18a", "bk_token": "UpI-0iHFYU26jNLTDV_3Jv7gCoY7LPvfr9k0xJMGy2Q", "bk_username": "admin"}' "http://bkapi.xxxxxx.com:80/api/bk-esb/prod/v2/monitor_v3/apm/create_application/"

{"result": false, "message": "Resource[CreateApplicationSimple] 请求参数格式错误:(app_name) 该字段是必填项。", "data": {}, "detail": "Resource[CreateApplicationSimple] 请求参数格式错误:(app_name) 该字段是必填项。", "code": 500, "request_id": "388600e6eaaf4848a5d6fe92d600b4e6"}%

@m84641693
Copy link
Author

m84641693 commented Mar 1, 2024

使用 apigw 自带的接口测试工具测试create_apm_application 接口:返回接口不存在:
image

请求网关资源失败,错误消息:HTTPConnectionPool(host='bkapi.xxxxxx.com', port=80): Max retries exceeded with url: /api/bk-esb/prod/v2/monitor_v3/create_apm_application/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe6a7690390>: Failed to establish a new connection: [Errno -2] Name or service not known',))。

@m84641693
Copy link
Author

请问一下这个BKAPP_MONITOR_API_BASE_URL值是在哪里配置的?
image

@m84641693
Copy link
Author

使用 apigw 自带的接口测试工具测试create_apm_application 接口:返回接口不存在: image

请求网关资源失败,错误消息:HTTPConnectionPool(host='bkapi.xxxxxx.com', port=80): Max retries exceeded with url: /api/bk-esb/prod/v2/monitor_v3/create_apm_application/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe6a7690390>: Failed to establish a new connection: [Errno -2] Name or service not known',))。

我大概分析了一下:
在系统里面请求create_apm_application的url是这个:
/api/bk-esb/prod/v2/monitor_v3/create_apm_application/
而真实的url是/api/bk-esb/prod/v2/monitor_v3/apm/create_application/
所以会导致创建出错。

@liuwenping
Copy link
Collaborator

根据页面报错信息,最前面的traceid,从监控的后台日志文件(kernel*.log)找下看。看下步骤到了哪一步

@m84641693
Copy link
Author

根据页面报错信息,最前面的traceid,从监控的后台日志文件(kernel*.log)找下看。看下步骤到了哪一步

您好!您指的的 traceid 是这个吗?
image

按照图中红框中的 id 去检索 kernel.log中的信息,没有找到。

我重新尝试创建了一个,按照上面的方法去找,也没有过滤相关数据。

@liuwenping
Copy link
Collaborator

估计没有开启traceid记录,也可以根据页面填写的英文名来grep,搜到任意一条日志后,再根据当前这条日志的进程ID做进一步grep

@liuwenping
Copy link
Collaborator

kernel开头的日志文件都要搜索一下,不能只检索kernel.log

@m84641693
Copy link
Author

kernel开头的日志文件都要搜索一下,不能只检索kernel.log

image

搜索到有以下两个以 kernel 开头的日志文件中有相关信息:
13152_kerner_api.log
kernel_metadata.log

@liuwenping
Copy link
Collaborator

kernel开头的日志文件都要搜索一下,不能只检索kernel.log

image

搜索到有以下两个以 kernel 开头的日志文件中有相关信息: 13152_kerner_api.log kernel_metadata.log

根据" 13152 "这个ID,进一步grep下这个文件,得到更多上下文

@m84641693
Copy link
Author

kernel开头的日志文件都要搜索一下,不能只检索kernel.log

image
搜索到有以下两个以 kernel 开头的日志文件中有相关信息: 13152_kerner_api.log kernel_metadata.log

根据" 13152 "这个ID,进一步grep下这个文件,得到更多上下文

刚才已经将日志收集并整理成附件了。
请您查阅
image

@liuwenping
Copy link
Collaborator

image
从截图来看,实际也是卡在了创建ES表这里。 试试手动curl创建一个ES的index看看情况怎么样

@m84641693
Copy link
Author

image 从截图来看,实际也是卡在了创建ES表这里。 试试手动curl创建一个ES的index看看情况怎么样

使用 curl 创建索引很顺畅。
image

@liuwenping
Copy link
Collaborator

有可能是请求没有到达api,api的worker不够处理当前请求量级。 尝试扩容一下监控的api服务试试

@m84641693
Copy link
Author

有可能是请求没有到达api,api的worker不够处理当前请求量级。 尝试扩容一下监控的api服务试试

您好,我当前的环境是蓝鲸 6.2.1(二进制部署),相对于容器化的部署方式扩容起来比较麻烦,我是否直接按照https://bk.tencent.com/docs/markdown/ZH/DeploymentGuides/6.2/MaintenanceManual/DailyMaintenance/scale_node.md中的方法去扩容 monitor_v3?

@m84641693
Copy link
Author

有可能是请求没有到达api,api的worker不够处理当前请求量级。 尝试扩容一下监控的api服务试试

今天做了以下尝试

# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
2b63598f6134        bk-monitor:3.8.2    "/bin/sh -c 'bash ./…"   4 months ago        Up 2 weeks                              bk-monitor

#  docker exec -it 2b63598f6134 bash
[root@RS-BKML-P1 monitor]# vi /data/bkce/etc/supervisor-bkmonitorv3-monitor.conf 

修改以上文件的[program:kernel_api]-->numprocs 从 1 改成 2
image

#  docker exec -it 2b63598f6134 supervisorctl -c /data/bkce/etc/supervisor-bkmonitorv3-monitor.conf reload
Restarted supervisord

退出容器,确认是否修改成功

image

尝试创建新的 APM 应用程序

问题依旧
image

@rxwycdh
Copy link
Collaborator

rxwycdh commented Mar 28, 2024

1️⃣ 创建APM 应用需要通过在页面上创建,只调用 apigw 上的接口是不行的,因为创建流程是 web 接口里面再调用了 apigw 接口,如果只调用 apgw 接口,那么就会缺失 web 接口的一些数据。

2️⃣ 根据提供的日志,看起来 es、influxdb 都创建成功了但是日志打印的开始时间是 10:04:27 结束时间是 10:04:58,刚好过程耗时了 30 秒左右,所以感觉可能并没有问题,只是 metadata 处理的比较慢导致的?看下能否到 apigw 将 create_apm_application 接口的超时时间延长,再重新创建下看看?

@m84641693
Copy link
Author

1️⃣ 创建APM 应用需要通过在页面上创建,只调用 apigw 上的接口是不行的,因为创建流程是 web 接口里面再调用了 apigw 接口,如果只调用 apgw 接口,那么就会缺失 web 接口的一些数据。

2️⃣ 根据提供的日志,看起来 es、influxdb 都创建成功了但是日志打印的开始时间是 10:04:27 结束时间是 10:04:58,刚好过程耗时了 30 秒左右,所以感觉可能并没有问题,只是 metadata 处理的比较慢导致的?看下能否到 apigw 将 create_apm_application 接口的超时时间延长,再重新创建下看看?

步骤一:将create_apm_application接口的超时时长设置到最大(300s)
image

步骤二:重新测试创建 APM
故障依旧
image

@rxwycdh
Copy link
Collaborator

rxwycdh commented Mar 28, 2024

辛苦进入一下 bkmonitor-api 的 shell 里面,手动执行一下创建应用的代码,看看是否会报错,如果不会,就是超时时间的问题了。

from apm.resources import CreateApplicationResource
from metadata.models import ClusterInfo

bk_biz_id = <your_biz_id>
app_name = "your_app_name"
description = ""
CreateApplicationResource()(
            **{
                "bk_biz_id": bk_biz_id,
                "app_name": app_name,
                "app_alias": app_alias,
                "description": description,
                "es_storage_config": {
                  "es_storage_cluster": ClusterInfo.objects.filter(cluster_name="es7_cluster").first().cluster_id
                },
            }
        )

@m84641693
Copy link
Author

辛苦进入一下 bkmonitor-api 的 shell 里面,手动执行一下创建应用的代码,看看是否会报错,如果不会,就是超时时间的问题了。

from apm.resources import CreateApplicationResource
from metadata.models import ClusterInfo

bk_biz_id = <your_biz_id>
app_name = "your_app_name"
description = ""
CreateApplicationResource()(
            **{
                "bk_biz_id": bk_biz_id,
                "app_name": app_name,
                "app_alias": app_alias,
                "description": description,
                "es_storage_config": {
                  "es_storage_cluster": ClusterInfo.objects.filter(cluster_name="es7_cluster").first().cluster_id
                },
            }
        )

您好,运行结果如下,请您查收!

# docker ps 
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
2b63598f6134        bk-monitor:3.8.2    "/bin/sh -c 'bash ./…"   4 months ago        Up 3 weeks                              bk-monitor
# docker exec -it 2b63598f6134 ./bin/api_manage.sh shell
Python 3.6.6 (default, Dec 13 2019, 19:33:38) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from apm.resources import CreateApplicationResource
   ...: from metadata.models import ClusterInfo
   ...: 
   ...: bk_biz_id = 2
   ...: app_name = "hj_test666"
   ...: description = ""
   ...: CreateApplicationResource()(
   ...:             **{
   ...:                 "bk_biz_id": bk_biz_id,
   ...:                 "app_name": app_name,
   ...:                 "app_alias": app_alias,
   ...:                 "description": description,
   ...:                 "es_storage_config": {
   ...:                   "es_storage_cluster": ClusterInfo.objects.filter(cluster_name="es7_cluster").first().cluster_id
   ...:                 },
   ...:             }
   ...:         )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-4475d199c2ec> in <module>
      9                 "bk_biz_id": bk_biz_id,
     10                 "app_name": app_name,
---> 11                 "app_alias": app_alias,
     12                 "description": description,
     13                 "es_storage_config": {

NameError: name 'app_alias' is not defined

In [2]: from apm.resources import CreateApplicationResource
   ...: from metadata.models import ClusterInfo
   ...: 
   ...: bk_biz_id = 2
   ...: app_name = "hj_test666"
   ...: description = ""
   ...: CreateApplicationResource()(
   ...:             **{
   ...:                 "bk_biz_id": bk_biz_id,
   ...:                 "app_name": app_name,
   ...:                 "app_alias": app_name,
   ...:                 "description": description,
   ...:                 "es_storage_config": {
   ...:                   "es_storage_cluster": ClusterInfo.objects.filter(cluster_name="es7_cluster").first().cluster_id
   ...:                 },
   ...:             }
   ...:         )
---------------------------------------------------------------------------
DoesNotExist                              Traceback (most recent call last)
/data/bkce/bkmonitorv3/monitor/apm/models/datasource.py in create_data_id(self)
     94         try:
---> 95             data_id_info = resource.metadata.query_data_source({"data_name": self.data_name})
     96         except metadata_models.DataSource.DoesNotExist:

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in __call__(self, *args, **kwargs)
     97 
---> 98         return ResourceData.objects.request(tmp_resource, args, kwargs)
     99 

/data/bkce/bkmonitorv3/monitor/core/drf_resource/models.py in request(self, resource, args, kwargs)
     52         if not (getattr(settings, "ENABLE_RESOURCE_DATA_COLLECT", False) and resource.support_data_collect):
---> 53             return resource.request(*args, **kwargs)
     54 

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in request(self, request_data, **kwargs)
    221             validated_request_data = self.validate_request_data(request_data)
--> 222             response_data = self.perform_request(validated_request_data)
    223             validated_response_data = self.validate_response_data(response_data)

/data/bkce/bkmonitorv3/monitor/metadata/resources/resources.py in perform_request(self, request_data)
    321         elif request_data["data_name"] is not None:
--> 322             data_source = models.DataSource.objects.get(data_name=request_data["data_name"])
    323 

/cache/.bk/env/lib/python3.6/site-packages/django/db/models/manager.py in manager_method(self, *args, **kwargs)
     84             def manager_method(self, *args, **kwargs):
---> 85                 return getattr(self.get_queryset(), name)(*args, **kwargs)
     86             manager_method.__name__ = method.__name__

/cache/.bk/env/lib/python3.6/site-packages/django/db/models/query.py in get(self, *args, **kwargs)
    436                 "%s matching query does not exist." %
--> 437                 self.model._meta.object_name
    438             )

DoesNotExist: DataSource matching query does not exist.

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-2-cd845453eeb6> in <module>
     12                 "description": description,
     13                 "es_storage_config": {
---> 14                   "es_storage_cluster": ClusterInfo.objects.filter(cluster_name="es7_cluster").first().cluster_id
     15                 },
     16             }

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in __call__(self, *args, **kwargs)
     96         from core.drf_resource.models import ResourceData
     97 
---> 98         return ResourceData.objects.request(tmp_resource, args, kwargs)
     99 
    100     @property

/data/bkce/bkmonitorv3/monitor/core/drf_resource/models.py in request(self, resource, args, kwargs)
     51         # 如果没有配置则退出
     52         if not (getattr(settings, "ENABLE_RESOURCE_DATA_COLLECT", False) and resource.support_data_collect):
---> 53             return resource.request(*args, **kwargs)
     54 
     55         resource_name = "{}.{}".format(resource.__class__.__module__, resource.__class__.__name__)

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in request(self, request_data, **kwargs)
    220             request_data = request_data or kwargs
    221             validated_request_data = self.validate_request_data(request_data)
--> 222             response_data = self.perform_request(validated_request_data)
    223             validated_response_data = self.validate_response_data(response_data)
    224             return validated_response_data

/data/bkce/bkmonitorv3/monitor/apm/resources.py in perform_request(self, validated_request_data)
     93             app_alias=validated_request_data["app_alias"],
     94             description=validated_request_data["description"],
---> 95             es_storage_config=validated_request_data["es_storage_config"],
     96         )
     97 

/cache/.bk/env/lib/python3.6/contextlib.py in inner(*args, **kwds)
     50         def inner(*args, **kwds):
     51             with self._recreate_cm():
---> 52                 return func(*args, **kwds)
     53         return inner
     54 

/data/bkce/bkmonitorv3/monitor/apm/models/application.py in create_application(cls, bk_biz_id, app_name, app_alias, description, es_storage_config)
     81 
     82         # step2: 创建结果表
---> 83         datasource_info = cls.apply_datasource(bk_biz_id, app_name, es_storage_config)
     84 
     85         # step3: 创建虚拟指标

/data/bkce/bkmonitorv3/monitor/apm/models/application.py in apply_datasource(cls, bk_biz_id, app_name, es_storage_config)
     60         # 默认创建和更新trace数据源和指标数据源
     61         for datasource in [TraceDataSource, MetricDataSource]:
---> 62             datasource.apply_datasource(bk_biz_id=bk_biz_id, app_name=app_name, **es_storage_config)
     63 
     64         return {

/cache/.bk/env/lib/python3.6/contextlib.py in inner(*args, **kwds)
     50         def inner(*args, **kwds):
     51             with self._recreate_cm():
---> 52                 return func(*args, **kwds)
     53         return inner
     54 

/data/bkce/bkmonitorv3/monitor/apm/models/datasource.py in apply_datasource(cls, bk_biz_id, app_name, **option)
    132             obj = cls.objects.create(bk_biz_id=bk_biz_id, app_name=app_name)
    133         # 创建data_id
--> 134         obj.create_data_id()
    135         # 创建结果表
    136         obj.create_or_update_result_table(**option)

/data/bkce/bkmonitorv3/monitor/apm/models/datasource.py in create_data_id(self)
    113                 **data_link_param,
    114             }
--> 115             data_id_info = resource.metadata.create_data_id(param)
    116         bk_data_id = data_id_info["bk_data_id"]
    117         self.bk_data_id = bk_data_id

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in __call__(self, *args, **kwargs)
     96         from core.drf_resource.models import ResourceData
     97 
---> 98         return ResourceData.objects.request(tmp_resource, args, kwargs)
     99 
    100     @property

/data/bkce/bkmonitorv3/monitor/core/drf_resource/models.py in request(self, resource, args, kwargs)
     51         # 如果没有配置则退出
     52         if not (getattr(settings, "ENABLE_RESOURCE_DATA_COLLECT", False) and resource.support_data_collect):
---> 53             return resource.request(*args, **kwargs)
     54 
     55         resource_name = "{}.{}".format(resource.__class__.__module__, resource.__class__.__name__)

/data/bkce/bkmonitorv3/monitor/core/drf_resource/base.py in request(self, request_data, **kwargs)
    220             request_data = request_data or kwargs
    221             validated_request_data = self.validate_request_data(request_data)
--> 222             response_data = self.perform_request(validated_request_data)
    223             validated_response_data = self.validate_response_data(response_data)
    224             return validated_response_data

/data/bkce/bkmonitorv3/monitor/metadata/resources/resources.py in perform_request(self, validated_request_data)
     93                 raise ValueError(_("空间唯一标识{}错误").format(space_uid))
     94 
---> 95         request = get_request()
     96         bk_app_code = get_app_code_by_request(request)
     97         # 当请求的 app_code 为空时,记录请求,用于后续优化处理

/data/bkce/bkmonitorv3/monitor/bkmonitor/utils/request.py in get_request(peaceful)
     25         return None
     26 
---> 27     raise Exception("get_request: current thread hasn't request.")
     28 
     29 

Exception: get_request: current thread hasn't request.

In [3]: 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants