Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Server Error due to attempt record with start_time/end_time both unset #14768

Open
jmarshall opened this issue Dec 16, 2024 · 0 comments
Assignees

Comments

@jmarshall
Copy link
Contributor

jmarshall commented Dec 16, 2024

What happened?

In scenarios where we have recently manually killed a job (via kill -TERM on the worker VM or similar), clicking on the job ID on the batch UI page to go to the particular job page instead results in 500 Internal Server Error.

Finding the server logs (see below) indicates that the problem is a record in the attempts database table with both start_time and end_time being NULL. Looking in the database shows that the record in question is the one for the next yet-to-be-started attempt, not the attempt that has just been killed (which in our observations has had end_time at least filled in).

This could be addressed by ensuring that at least one of these fields is always non-NULL, or more likely by making the attempts.sort(…) invocation more robust, e.g., via

attempts.sort(key=lambda x: x['start_time'] or x['end_time'] or MAXINT)

(where MAXINT is a suitable value to make these entries sort last)

Version

0.2.133

Relevant log output

{"severity":"ERROR","levelname":"ERROR","asctime":"2024-12-04 22:51:26,136","filename":"web_protocol.py","funcNameAndLine":"log_exception:421","message":"Error handling request","exc_info":"Traceback (most recent call last):
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py\", line 452, in _handle_request
    resp = await request_handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py\", line 543, in _handle
    resp = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py\", line 114, in impl
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/csrf.py\", line 27, in check_csrf_token
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/batch/utils.py\", line 19, in unavailable_if_frozen
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/metrics.py\", line 28, in monitor_endpoints_middleware
    response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request))  # type: ignore
  File \"/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py\", line 55, in measure
    rv = await future
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py\", line 199, in factory
    response = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/auth.py\", line 68, in wrapped
    return await fun(request, userdata)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 202, in wrapped
    return await fun(request, userdata, batch_id)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 163, in wrapped
    return await fun(request, userdata, *args, **kwargs)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2940, in ui_get_job
    job, attempts, job_log_bytes, resource_usage = await asyncio.gather(
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2640, in _get_attempts
    attempts.sort(key=lambda x: x['start_time'] or x['end_time'])
TypeError: '<' not supported between instances of 'NoneType' and 'int'","hail_log":1}
@jmarshall jmarshall added the needs-triage A brand new issue that needs triaging. label Dec 16, 2024
@patrick-schultz patrick-schultz removed the needs-triage A brand new issue that needs triaging. label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants