Internal Server Error due to attempt record with start_time/end_time both unset #14768

jmarshall · 2024-12-16T20:51:38Z

What happened?

In scenarios where we have recently manually killed a job (via kill -TERM on the worker VM or similar), clicking on the job ID on the batch UI page to go to the particular job page instead results in 500 Internal Server Error.

Finding the server logs (see below) indicates that the problem is a record in the attempts database table with both start_time and end_time being NULL. Looking in the database shows that the record in question is the one for the next yet-to-be-started attempt, not the attempt that has just been killed (which in our observations has had end_time at least filled in).

This could be addressed by ensuring that at least one of these fields is always non-NULL, or more likely by making the attempts.sort(…) invocation more robust, e.g., via

attempts.sort(key=lambda x: x['start_time'] or x['end_time'] or MAXINT)

(where MAXINT is a suitable value to make these entries sort last)

Version

0.2.133

Relevant log output

{"severity":"ERROR","levelname":"ERROR","asctime":"2024-12-04 22:51:26,136","filename":"web_protocol.py","funcNameAndLine":"log_exception:421","message":"Error handling request","exc_info":"Traceback (most recent call last):
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py\", line 452, in _handle_request
    resp = await request_handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py\", line 543, in _handle
    resp = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py\", line 114, in impl
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/csrf.py\", line 27, in check_csrf_token
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/batch/utils.py\", line 19, in unavailable_if_frozen
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/metrics.py\", line 28, in monitor_endpoints_middleware
    response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request))  # type: ignore
  File \"/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py\", line 55, in measure
    rv = await future
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py\", line 199, in factory
    response = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/auth.py\", line 68, in wrapped
    return await fun(request, userdata)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 202, in wrapped
    return await fun(request, userdata, batch_id)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 163, in wrapped
    return await fun(request, userdata, *args, **kwargs)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2940, in ui_get_job
    job, attempts, job_log_bytes, resource_usage = await asyncio.gather(
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2640, in _get_attempts
    attempts.sort(key=lambda x: x['start_time'] or x['end_time'])
TypeError: '<' not supported between instances of 'NoneType' and 'int'","hail_log":1}

The text was updated successfully, but these errors were encountered:

jmarshall added the needs-triage A brand new issue that needs triaging. label Dec 16, 2024

patrick-schultz assigned cjllanwarne Jan 6, 2025

patrick-schultz removed the needs-triage A brand new issue that needs triaging. label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Server Error due to attempt record with start_time/end_time both unset #14768

Internal Server Error due to attempt record with start_time/end_time both unset #14768

jmarshall commented Dec 16, 2024 •

edited

Loading

Internal Server Error due to attempt record with start_time/end_time both unset #14768

Internal Server Error due to attempt record with start_time/end_time both unset #14768

Comments

jmarshall commented Dec 16, 2024 • edited Loading

What happened?

Version

Relevant log output

jmarshall commented Dec 16, 2024 •

edited

Loading