Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-admin users lose access after some time (around 24h) #3607

Closed
via-justa opened this issue Jan 2, 2024 · 8 comments
Closed

Non-admin users lose access after some time (around 24h) #3607

via-justa opened this issue Jan 2, 2024 · 8 comments
Labels
bug Something isn't working part:auth/teams

Comments

@via-justa
Copy link

What went wrong?

What happened:
Around 24 hours after starting the service, Users are getting the following error message.

User with Admin permissions in your organization must sign on and setup OnCall before it can be used

image

Admins seeing a pop-up

Service account token deleted

followed by

Plugin settings updated

and are able to operate the tool without an issue
Screenshot 2024-01-02 at 16 59 37

The logs on Grafana:

logger=authn.service t=2024-01-02T12:15:54.035302975Z level=info msg="Failed to authenticate request" client=auth.client.api-key error="[api-key.invalid] API key is invalid"
logger=context userId=0 orgId=0 uname= t=2024-01-02T12:15:54.035473904Z level=info msg= error="[api-key.invalid] API key is invalid" remote_addr=xx.xx.xx.xx traceID=
logger=context userId=0 orgId=0 uname= t=2024-01-02T12:15:54.035554122Z level=info msg="Request Completed" method=HEAD path=/api/access-control/users/permissions/search status=401 remote_addr=xx.xx.xx.xx time_ms=0 duration=290.018µs size=0 referer= handler=notfound
logger=authn.service t=2024-01-02T12:15:54.155269673Z level=info msg="Failed to authenticate request" client=auth.client.api-key error="[api-key.invalid] API key is invalid"
logger=context userId=0 orgId=0 uname= t=2024-01-02T12:15:54.155344844Z level=info msg= error="[api-key.invalid] API key is invalid" remote_addr=xx.xx.xx.xx traceID=
logger=context userId=0 orgId=0 uname= t=2024-01-02T12:15:54.155797243Z level=info msg="Request Completed" method=HEAD path=/api/org status=401 remote_addr=xx.xx.xx.xx time_ms=0 duration=598.942µs size=0 referer= handler=/api/org/
logger=context userId=60 orgId=1 uname=sa-sa-autogen-oncall t=2024-01-02T12:15:54.568190049Z level=info msg="Request Completed" method=HEAD path=/api/access-control/users/permissions/search status=404 remote_addr=xx.xx.xx.xx time_ms=20 duration=20.892124ms size=0 referer= handler=notfound

Logs on engine

2024-01-02 12:15:54 source=engine:app google_trace_id=none logger=apps.grafana_plugin.helpers.client Error connecting to api instance 404 Client Error: Not Found for url: https://grafana.xxx.com/api/access-control/users/permissions/search?actionPrefix=grafana-oncall-app
2024-01-02 12:15:54 source=engine:app google_trace_id=none logger=root outbound latency=0.1165516022592783 status=404 method=HEAD url=https://grafana.xxx.com/api/access-control/users/permissions/search?actionPrefix=grafana-oncall-app slow=0 
2024-01-02 12:15:54 source=engine:app google_trace_id=none logger=root outbound latency=0.1029507745988667 status=200 method=HEAD url=https://grafana.xxx.com/api/org slow=0 

What did you expect to happen:

  • non-admin users can access Grafana OnCall without errors

Maybe related issues

#3236
#3566

Our setup

Grafana 10.2.2 deployed with Grafana Helm chart
Grafana OnCall 1.3.77 deployed with the helm chart from this repo
GCP CloudSQL DB for both Grafana and Oncall
GCP MemoryStore Broker

Cloudflare ZeroTrust protects Oncall and Grafana.
We Use Cloudflare as Auth poxy for Grafana but the error is the same with local users.

Config
base_url: "oncall.domain.com"
base_url_protocol: https

engine:
replicaCount: 1

celery:
replicaCount: 1

migrate:
annotations:
  argocd.argoproj.io/hook: PreSync
  argocd.argoproj.io/hook-delete-policy: HookSucceeded

ingress:
enabled: false

ingress-nginx:
enabled: false

cert-manager:
enabled: false

service:
enabled: false
type: NodePort
port: 8080
nodePort: 30001

grafana:
enabled: false

externalGrafana:
url: "https://grafana.domain.com/"

database:
# can be either mysql or postgresql
type: postgresql

mariadb:
enabled: false

externalPostgresql:
host: "xxxxxxxxxxx"
port: "5432"
db_name: "xxxxxxxxxxx"
user: "xxxxxxxxxxx"
password: "xxxxxxxxxxx"

broker:
type: "redis"

rabbitmq:
enabled: false

redis:
enabled: false

# workarround for passwordless auth as described here: https://github.com/grafana/oncall/issues/2941
externalRedis:
host: "place_holder"
password: "place_holder"

env:
REDIS_URI: "redis://xxxxxxxxxxx:6379/1"
GRAFANA_CLOUD_ONCALL_API_URL: "https://oncall-prod-eu-west-0.grafana.net/oncall"

oncall:
exporter:
  enabled: true
slack:
  # Enable the Slack ChatOps integration for the Oncall Engine.
  enabled: true
  clientId: xxxxxxxxxxx
  clientSecret: xxxxxxxxxxx
  signingSecret: xxxxxxxxxxx
  redirectHost: "https://grafana.domain.com/"

How do we reproduce it?

  1. Install with the provided configuration
  2. Create a local non-admin user
  3. wait > 24 hours
  4. login with the non-admin user and open OnCall

Grafana OnCall Version

1.3.77

Product Area

Auth

Grafana OnCall Platform?

Kubernetes

User's Browser?

Any

Anything else to add?

No response

@via-justa via-justa added the bug Something isn't working label Jan 2, 2024
@via-justa
Copy link
Author

Switching externalGrafana to the internal cluster service and connecting the plugin again via the cluster's internal address did not change anything. I'm really out of ideas on how to debug the issue.

@dmitry-tiger
Copy link

When I tried to investigate that case I found out that some of "old" tokens are somehow cached in request object and when new token is generated other user does status request then token check fails and it changes token status in database to failed.
I also noticed that it happens after every time I do some redeployment which runs job-migrate, but it is not only case that cause this issue.
As a work around I use engine container restart and it works in most cases.

@via-justa
Copy link
Author

@dmitry-tiger, are you restarting the engine every 24 hours?

@via-justa
Copy link
Author

@grafana/grafana-oncall-frontend
Any ideas or directions for investigation?

@dmitry-tiger
Copy link

@dmitry-tiger, are you restarting the engine every 24 hours?

I can't say that it happens every day in our environment but sometimes it happens quite often (2-3 times a week)
Do you have some routines scheduled every day probably some tasks running like job-migrate

@via-justa
Copy link
Author

@dmitry-tiger, no jobs on my side. Only the built-in functionality.
It's a new installation for a planned migration from OpsGenie.

@via-justa
Copy link
Author

I enabled debugging on Grafana and saw the following error when a user opened the OnCall pannel.

 logger=accesscontrol t=2024-01-17T12:36:51.195411895Z level=debug msg="No permissions set for entity" namespace=user id=65 orgID=0 [email protected]

Our org ID is 1. In the DB, the user is with org ID 1. Why is it sending org ID 0?

I'm not the only one having this issue. Can one of the developers please jump in and assist?

@mderynck
Copy link
Contributor

mderynck commented Feb 9, 2024

This was resolved by keeping mirageSecretKey constant through an external secret.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working part:auth/teams
Projects
None yet
Development

No branches or pull requests

3 participants