Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0193 - Stable UNIX user UIDs #50414

Merged
merged 11 commits into from
Jan 17, 2025
76 changes: 76 additions & 0 deletions rfd/0193-stable-unix-user.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
authors: Edoardo Spadolini ([email protected])
state: draft
---

# RFD 0193 - Stable UNIX user UIDs

## Required Approvers

* Engineering: someone from scale, someone from server access
espadolini marked this conversation as resolved.
Show resolved Hide resolved

## What

Add a way for the control plane to generate and store "stable" UIDs to be used for automatically provisioned UNIX users across all Teleport SSH servers.

## Why

To support interoperability with tools that rely on UIDs to identity users across different machines running the Teleport SSH server.

## Goal

After the appropriate setting is enabled in the control plane, all compliant (i.e. up to date) Teleport SSH nodes will query the control plane to know which UID to use when attempting to provision a host user if a Teleport user logs into the machine over SSH with a username that doesn't currently exist as a host user on the machine, with a roleset that allows for host user creation in "keep" mode for the specified machine and username, and with no `host_user_uid` trait. Using the returned UID will be a requirement for both the user and for its primary group: if another host user has the same UID, or a group has the same GID - and thus user creation will fail - the login will also fail. If the host user already exists on the machine, just like the current behavior, the login will just proceed with the existing user. The `host_user_uid` trait, if set, will take priority over the stable UID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if another host user has the same UID, or a group has the same GID - and thus user creation will fail - the login will also fail

What is the means to resolving this state? Should there be an event or notification emitted to alert admins that manual intervention is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure of the extent of how detailed we can make our errors through SSH, but surely the user that can't log in will ask someone for help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum we should make sure that the ssh service logs are very clear about the problem to reduce support load

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably out of scope for this RFD, but if we had a reasonable way to surface non-fatal errors after the session is established it would probably be better to inform them that they're running an unexpected UID for their login without breaking the connection. Maybe that's something we could revisit down the line


## UX

The `cluster_auth_preference` configuration singleton will grow a new field `spec.stable_unix_user_config`:

```yaml
kind: cluster_auth_preference
version: v2
metadata:
labels:
teleport.dev/origin: dynamic
name: cluster-auth-preference
revision: 8ac8cd36-7f80-452b-8b77-147a5588f25f
spec:
# ...
stable_unix_user_config:
enabled: true
first_uid: 7000001
last_uid: 7019999
```

Teleport SSH servers will check the `enabled` field to know if the feature is enabled, and - if so - they will query the auth server for the UID to use through a new rpc whenever they need to provision a new host user in "keep" mode with no otherwise defined UID. In the initial implementation, provisioned host groups other than the primary group will be generated according to the default system behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this information be dictated to the agent by the control plane once the PDP work is implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stapled access decision info coming from the control plane to the SSH server are still very much to be determined, but we could pass along a stable UID given some conditions (the user having provisioning enabled on the server, a single username allowed to log in, the feature being enabled and the unix username already having an associated stable UID), and let the server use the RPC defined here to fetch a stable UID if it's missing from the stapled info. In general we shouldn't preemptively define a stable UID for every username, and the control plane can't know if the server already has the username registered in the local user directory.


## Internals

The Teleport SSH agent will use the new `teleport.userprovisioning.v2.StableUNIXUsersService/ObtainUIDForUsername` rpc to query the UID for any new username; the auth server will generate and persist a UID if one hasn't been set for the given username, or read the already persisted one, and then it will return it to the agent.

```proto
service StableUNIXUsersService {
rpc ObtainUIDForUsername(ObtainUIDForUsernameRequest) returns (ObtainUIDForUsernameResponse) {
option idempotency_level = IDEMPOTENT;
}
}

message ObtainUIDForUsernameRequest {
string username = 1;
}

message ObtainUIDForUsernameResponse {
int32 uid = 2;
}
```

In the initial implementation there will be no way to read or manage the list of users and no tunables other than the range of UIDs available for use and the `enabled` flag. It is not planned to ever allow for deleting a UID after it's been assigned, but in the future we could add the ability to clear the assigned username for a given UID while still leaving it occupied.
rosstimothy marked this conversation as resolved.
Show resolved Hide resolved

The cluster state storage will contain a bidirectional mapping of usernames and UIDs, consisting of two items per username, one keyed by username at `/stable_unix_user/by_username/<hex username>` containing the UID and one keyed by UID at `/stable_unix_user/by_uid/<uid as 8 hex digits>` containing the username (to allow for ranged queries in numerical order), as well as a "hint" for the next available UID at `/stable_unix_user/next_uid_hint`.

If reading `by_username/<username>` succeeds, the UID stored in it will be returned (even if outside of the currently defined UID range); otherwise, if the `next_uid_hint` is in the UID range, the auth will attempt to atomically create `by_username/<username>` and `by_uid/<uid>`, update `next_uid_hint`, and assert that `cluster_auth_preference` hasn't changed.

If the operation succeeds, the UID is returned, otherwise the operation is tried again, after some checks: if the UID for the username is still missing, the CAP (after a hard read from the backend) hasn't changed and the `next_uid_hint` hasn't changed (which we have to check separately because a conditional failure of an atomic operation unfortunately doesn't return details about which check has failed), it means that we might have ran out of contiguous space in the UID range. If this is the case, or if `next_uid_hint` is missing, or something else has gone wrong, the Auth server will scan the `by_uid` key range from the `first_uid` to the `last_uid` to find the first available unassigned UID - if `next_uid_hint` is present and in range, the search can start with `next_uid_hint` and then wrap around the end of the range to the beginning. After finding a usable UID, the auth will then proceeed to try the atomic write of `by_username/<username>` and `by_uid/<uid>` again, still asserting the revision of `cluster_auth_preference` and either creating, updating, deleting, or asserting the nonexistence of `next_uid_hint`, depending on whether or not the scan has revealed a second unallocated UID. This somewhat abstruse logic is needed to sanely handle changes in the valid UID range while UIDs have been and are actively being persisted.

Seeing as we are not going to change UID allocations, the auth server will use a simple LRU or time-based cache for the very happy path (in which the username already has an assigned UID), to avoid bursts of backend reads as a result of a new username logging into several different servers for the first time in quick succession (with a `tsh` multi-host command or with something like Ansible). Since there's no expectation that any given value will be heavily read (since, in theory, each host will read each user at most once) and every write will require hard reads and atomic operations, we don't expect to need to replicate the data in the auth cache or to support watching, at least in the first implementation.

The `ObtainUIDForUsername` rpc will require `create`+`read` permissions on the `stable_unix_user` kind, which will be granted to the `Node` builtin role. Future RPCs might include point and range reads for the mapping, which should require `read` and `read`+`list` permissions respectively, which can be assigned to a user or bot for inspection and automation.
Loading