Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial cockroachdb-operator charm for review #1

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[flake8]
max-line-length = 99
9 changes: 9 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[submodule "mod/jinja"]
path = mod/jinja
url = https://github.com/pallets/jinja
[submodule "mod/ops-interface-proxy-listen-tcp"]
dshcherb marked this conversation as resolved.
Show resolved Hide resolved
path = mod/ops-interface-proxy-listen-tcp
url = https://github.com/dshcherb/ops-interface-proxy-listen-tcp
[submodule "mod/operator"]
path = mod/operator
url = https://github.com/canonical/operator
58 changes: 58 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
CockroachDB Charm
==================================

# Overview

This charm provides means to deploy and operate cockroachdb - a scalable, cloud-native SQL database with built-in clustering support.

# Deployment Requirements

The charm requires Juju 2.7.5 to be present (see [LP: #1865229](https://bugs.launchpad.net/juju/+bug/1865229)).
dshcherb marked this conversation as resolved.
Show resolved Hide resolved

# Deployment

In order to deploy CockroachDB in a single-node mode, set replication factors to 1 explicitly.

```bash
juju deploy <charm-src-dir> --config default-zone-replicas=1 --config system-data-replicas=1
```

CockroachDB will use a replication factor of 3 unless explicitly specified.

```bash
juju deploy <charm-src-dir>
juju add-unit cockroachdb -n 2
```

HA with an explicit amount of replicas.

```bash
juju deploy <charm-src-dir> --config default-zone-replicas=3 --config system-data-replicas=3 -n 3
```

# Accessing CLI

```
juju ssh cockroachdb/0
cockroach sql
```

# Web UI

The web UI is accessible at `https://<unit-ip>:8080`

# Using Haproxy as a Load-balancer

An app deployed by this charm can be related to [charm-haproxy](https://github.com/dshcherb/charm-haproxy):

```bash
juju deploy <cockroachdb-charm-src-dir> --config default-zone-replicas=3 --config system-data-replicas=3 -n 3
juju deploy <haproxy-charm-src-dir>
juju relate haproxy cockroachdb
```

Currently the WEB UI is not exposed to an HTTP load-balancer (only postgres protocol connections over TCP are).

# Known Issues

The charm uses a workaround for [LP: #1859769](https://bugs.launchpad.net/juju/+bug/1859769) for single-node deployments by saving a cluster ID in a local state before the peer relation becomes available.
13 changes: 13 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
options:
version:
type: string
description: CockroachDB version to use.
default: v19.2.2
default-zone-replicas:
type: int
description: The number of replicas to configure for the default replication zone. Using 0 means that the charm will use cluster defaults (see https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#replication-zone-variables).
default: 0
system-data-replicas:
type: int
description: The number of replicas to configure for the internal system data. Using 0 means that the charm will use cluster defaults (see https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#replication-zone-variables)
default: 0
1 change: 1 addition & 0 deletions hooks/install
1 change: 1 addition & 0 deletions lib/interface_proxy_listen_tcp.py
1 change: 1 addition & 0 deletions lib/jinja2
1 change: 1 addition & 0 deletions lib/ops
24 changes: 24 additions & 0 deletions metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: cockroachdb
summary: CockroachDB Charm
maintainers:
- github.com/dshcherb
description: CockroachDB Charm
min-juju-version: 2.7.5
tags:
- database
provides:
db:
interface: pgsql
optional: true
proxy-listen-tcp:
interface: proxy-listen-tcp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation in this file is inconsistent.

peers:
cluster:
interface: cockroachdb-peer
series:
- bionic
resources:
cockroach-linux-amd64:
type: file
filename: cockroach.linux-amd64.tgz
description: An archive with a binary named "cockroach" as downloaded from https://binaries.cockroachdb.com/cockroach-<version>.linux-amd64.tgz
1 change: 1 addition & 0 deletions mod/jinja
Submodule jinja added at bff489
1 change: 1 addition & 0 deletions mod/operator
Submodule operator added at d259e0
1 change: 1 addition & 0 deletions mod/ops-interface-proxy-listen-tcp
256 changes: 256 additions & 0 deletions src/charm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
#!/usr/bin/env python3

import os
import subprocess
import socket
import re
import pwd
import sys
sys.path.append('lib') # noqa

from ops.charm import CharmBase, CharmEvents
from ops.framework import EventBase, EventSource, StoredState
from ops.main import main
from ops.model import (
ActiveStatus,
BlockedStatus,
MaintenanceStatus,
WaitingStatus,
ModelError,
)
from cluster import CockroachDBCluster
from interface_proxy_listen_tcp import ProxyListenTcpInterfaceProvides
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a nice file or module name to read/write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I rushed a bit with names. I think names like TCPLoadBalancer and TCPServer (or TCPBackend) would be more comprehensive.


from jinja2 import Environment, FileSystemLoader
from datetime import timedelta
from time import sleep


class CockroachStartedEvent(EventBase):
pass


class ClusterInitializedEvent(EventBase):
def __init__(self, handle, cluster_id):
super().__init__(handle)
self.cluster_id = cluster_id

def snapshot(self):
return self.cluster_id

def restore(self, cluster_id):
self.cluster_id = cluster_id


class CockroachDBCharmEvents(CharmEvents):
cockroachdb_started = EventSource(CockroachStartedEvent)
cluster_initialized = EventSource(ClusterInitializedEvent)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here is from CochroachDB, so either we consistently prefix them after the database, which doesn't seem so nice since the class itself already gives these names context, or we avoid the term in all of them. My preference is for the latter.

The first one might be database_started, or daemon_started, etc, depending on what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, daemon_started or even daemonized would be better.



class CockroachDBCharm(CharmBase):
on = CockroachDBCharmEvents()
state = StoredState()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should probably be _stored instead.


COCKROACHDB_SERVICE = 'cockroachdb.service'
SYSTEMD_SERVICE_FILE = f'/etc/systemd/system/{COCKROACHDB_SERVICE}'
WORKING_DIRECTORY = '/var/lib/cockroach'
COCKROACH_INSTALL_DIR = '/usr/local/bin'
COCKROACH_BINARY_PATH = f'{COCKROACH_INSTALL_DIR}/cockroach'
COCKROACH_USERNAME = 'cockroach'
PSQL_PORT = 26257
HTTP_PORT = 8080

MAX_RETRIES = 10
RETRY_TIMEOUT = timedelta(milliseconds=125)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unexpected to see all of these under this class. What's the rationale?

They feel more like properties of the overall charm code than this type in particular. If they indeed are supposed to be changeable somehow depending on this type, though, then they likely shouldn't look like constants. But happy to discuss further looking at established conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to associate them with a particular type until I needed them elsewhere to avoid having globals. I aim to split out the code that mutates the machine state into a separate component - it might be that the need to pull those attributes out will arise at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss further if it's a problem.

Otherwise, I moved some "constants" to a separate type:
e6741a9


def __init__(self, framework, key):
super().__init__(framework, key)

self.state.set_default(is_started=False)

for event in (self.on.install,
self.on.start,
# self.on.upgrade_charm,
self.on.config_changed,
self.on.cluster_relation_changed,
self.on.cockroachdb_started,
self.on.proxy_listen_tcp_relation_joined):
self.framework.observe(event, self)
dshcherb marked this conversation as resolved.
Show resolved Hide resolved

self.cluster = CockroachDBCluster(self, 'cluster')
self.tcp_load_balancer = ProxyListenTcpInterfaceProvides(self, 'proxy-listen-tcp')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a better type name for that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed in #1 (comment)


def on_install(self, event):
try:
resource_path = self.model.resources.fetch('cockroach-linux-amd64')
except ModelError:
resource_path = None

if resource_path is None:
ARCHITECTURE = 'amd64' # hard-coded until it becomes important
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all caps when this is a local variable?

What if resource_path is not None? The variable will be unset and the logic below will break down, I think?

Copy link
Contributor Author

@dshcherb dshcherb Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all caps when this is a local variable?

Just to indicate that it is meant to be a constant on purpose.

If a resource is not provided at the deployment time or later, and an attempt is made to retrieve it, a ModelError will be raised which will result in None being assigned to resource_path (so we can just try and download a binary instead). Otherwise, it will be a valid file system path.

version = self.model.config['version']
cmd = (f'wget -qO- https://binaries.cockroachdb.com/'
f'cockroach-{version}.linux-{ARCHITECTURE}.tgz'
f'| tar -C {self.COCKROACH_INSTALL_DIR} -xvz --wildcards'
' --strip-components 1 --no-anchored "cockroach*/cockroach"')
subprocess.check_call(cmd, shell=True)
os.chown(self.COCKROACH_BINARY_PATH, 0, 0)
else:
cmd = ['tar', '-C', self.COCKROACH_INSTALL_DIR, '-xv', '--wildcards',
'--strip-components', '1', '--no-anchored', 'cockroach*/cockroach',
'-zf', str(resource_path)]
subprocess.check_call(cmd)

self._setup_systemd_service()

@property
def is_single_node(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped that initial review here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes:

e6741a9

"""Both replication factors were set to 1 so it's a good guess that an operator wants
a 1-node deployment."""
default_zone_rf = self.model.config['default-zone-replicas']
system_data_rf = self.model.config['system-data-replicas']
return default_zone_rf == 1 and system_data_rf == 1

def _setup_systemd_service(self):
if self.is_single_node:
# start-single-node will set replication factors for all zones to 1.
exec_start_line = (f'ExecStart={self.COCKROACH_BINARY_PATH} start-single-node'
' --advertise-addr {self.cluster.advertise_addr} --insecure')
else:
peer_addresses = [self.cluster.advertise_addr]
if self.cluster.is_joined:
peer_addresses.extend(self.cluster.peer_addresses)
join_addresses = ','.join([str(a) for a in peer_addresses])
# --insecure until the charm gets CA setup support figured out.
exec_start_line = (f'ExecStart={self.COCKROACH_BINARY_PATH} start --insecure '
f'--advertise-addr={self.cluster.advertise_addr} '
f'--join={join_addresses}')
ctxt = {
'working_directory': self.WORKING_DIRECTORY,
'exec_start_line': exec_start_line,
}
env = Environment(loader=FileSystemLoader('templates'))
template = env.get_template('cockroachdb.service')
rendered_content = template.render(ctxt)

content_hash = hash(rendered_content)
# TODO: read the rendered file instead to account for any manual edits.
old_hash = getattr(self.state, 'rendered_content_hash', None)

if old_hash is None or old_hash != content_hash:
self.state.rendered_content_hash = content_hash
with open(self.SYSTEMD_SERVICE_FILE, 'wb') as f:
f.write(rendered_content.encode('utf-8'))
subprocess.check_call(['systemctl', 'daemon-reload'])

try:
pwd.getpwnam(self.COCKROACH_USERNAME)
except KeyError:
subprocess.check_call(['useradd',
'-m',
'--home-dir',
self.WORKING_DIRECTORY,
'--shell',
'/usr/sbin/nologin',
self.COCKROACH_USERNAME])

if self.state.is_started:
subprocess.check_call(['systemctl', 'restart', f'{self.COCKROACHDB_SERVICE}'])

def on_start(self, event):
unit = self.model.unit
# If both replication factors are set to 1 and the current unit != initial cluster unit,
# don't start the process if the cluster has already been initialized.
# This configuration is not practical in real deployments (i.e. multiple units, RF=1).
initial_unit = self.cluster.initial_unit
if self.is_single_node and (
initial_unit is not None and self.model.unit.name != initial_unit):
unit.status = BlockedStatus('Extra unit in a single-node deployment.')
return
subprocess.check_call(['systemctl', 'start', f'{self.COCKROACHDB_SERVICE}'])
self.state.is_started = True
self.on.cockroachdb_started.emit()

if self.cluster.is_joined and self.cluster.is_cluster_initialized:
unit.status = ActiveStatus()

def on_cluster_relation_changed(self, event):
self._setup_systemd_service()

if self.state.is_started and self.cluster.is_cluster_initialized:
self.model.unit.status = ActiveStatus()

def on_cockroachdb_started(self, event):
if not self.cluster.is_joined and not self.is_single_node:
self.unit.status = WaitingStatus('Waiting for peer units to join.')
event.defer()
return

if self.cluster.is_cluster_initialized:
# Skip this event when some other unit has already initialized a cluster.
self.unit.status = ActiveStatus()
return
elif not self.unit.is_leader():
self.unit.status = WaitingStatus(
'Waiting for the leader unit to initialize a cluster.')
event.defer()
return

self.unit.status = MaintenanceStatus('Initializing the cluster')
# Initialize the cluster if we're a leader in a multi-node deployment, otherwise it have
# already been initialized by running start-single-node.
if not self.is_single_node and self.model.unit.is_leader():
subprocess.check_call([self.COCKROACH_BINARY_PATH, 'init', '--insecure'])

self.on.cluster_initialized.emit(self.__get_cluster_id())
self.unit.status = ActiveStatus()

def __get_cluster_id(self):
for _ in range(self.MAX_RETRIES):
res = subprocess.run([self.COCKROACH_BINARY_PATH, 'debug', 'gossip-values',
'--insecure'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if not res.returncode:
out = res.stdout.decode('utf-8')
break
elif not re.findall(r'code = Unavailable desc = node waiting for init',
res.stderr.decode('utf-8')):
raise RuntimeError(
'unexpected error returned while trying to obtain gossip-values')
sleep(self.RETRY_TIMEOUT.total_seconds())

cluster_id_regex = re.compile(r'"cluster-id": (?P<uuid>[0-9a-fA-F]{8}\-[0-9a-fA-F]'
r'{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12})$')
for line in out.split('\n'):
m = cluster_id_regex.match(line)
if m:
return m.group('uuid')
raise RuntimeError('could not find cluster-id in the gossip-values output')

def on_config_changed(self, event):
# TODO: handle configuration changes to replication factors and apply them
# via cockroach sql.
pass

def on_proxy_listen_tcp_relation_joined(self, event):
if not self.cluster.is_cluster_initialized or not self.state.is_started:
event.defer()

# TODO: make load-balancer options tunable.
listen_options = [
f'bind :{self.PSQL_PORT}',
f'balance roundrobin',
f'timeout connect 10s',
f'timeout client 1m',
f'timeout server 1m',
'option clitcpka',
'option httpchk GET /health?ready=1',
'option tcplog',
]
fqdn = socket.getnameinfo((str(self.cluster.advertise_addr), 0), socket.NI_NAMEREQD)[0]
server_option = (f'server {fqdn} {self.cluster.advertise_addr}:{self.PSQL_PORT}'
' check port {self.HTTP_PORT}')
self.tcp_load_balancer.expose_server(self.PSQL_PORT, listen_options, server_option)


if __name__ == '__main__':
main(CockroachDBCharm)
Loading