This document describes the set of Custom Resource Definitions (CRDs) that will be used to configure a GCS Gluster cluster. The actual implementation of the resources described here will be phased in during development. The purpose of this document is to provide the overall structure, ensuring the end result provides necessary configurability in a user-friendly manner.
Overview
A single Gluster operator may control one or more individual Gluster clusters. Each cluster could be either hosted within the Kubernetes cluster as a set of pods (a converged configuration) or the Gluster nodes could be running on hosts outside the Kubernetes cluster (referred to as independent mode). The capabilities of the operator differ significantly between these two modes of deployment, but the same set of CRDs should be used for both where possible.
A given Gluster cluster is defined by several different Custom Resources (CRs)
that form a hierarchy. At the top level is the "cluster" CR (GlusterCluster
)
that describes cluster-wide configuration options such as the "name" for the
cluster, the TLS credentials that will be used for securing communication, and
peer clusters for geo-replication.
Incorporated into the cluster definition are a number of node definition
templates. These describe the different configurations of nodes that the
operator can create and how those nodes are spread across failure domains. Only
nodes that use PersistentVolumes for their storage can be created via template.
Other node types must be created manually. This includes both converged nodes
that use local devices (directly accessing the /dev/...
tree) and independent
nodes that reside on external servers.
Below the cluster definition are node definitions (GlusterNode
) that track
the state of the individual Gluster pods. Manipulating these objects permits an
administrator to place a node into a "disabled" state for maintenance or to
decommission it entirely (by deleting the node object).
Custom resources
This section describes the fields in each of the custom resources.
Cluster CR
The cluster CR defines the cluster-level configuration. A commented example is shown below:
apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
# Name for the Gluster cluster that will be created by the operator
name: my-cluster
# CR is namespaced
namespace: gcs
spec:
# Cluster options allows setting "gluster vol set" options that are
# cluster-wide (i.e. don't take a volname argument).
clusterOptions: # (optional)
"cluster.halo-enabled": "yes"
# Drivers lists the CSI drivers that should be deployed for use with this
# cluster
drivers:
- gluster-fuse
- gluster-block
# Gluster CA to use for generating Gluster TLS keys.
# Contains Secret w/ CA key & cert
glusterCA: # (optional)
secretName: my-secret
secretNamespace: my-ns # default is metadata.namespace
# Georeplication
replication: # (optional)
# Credentials for using this cluster as a target
credentials:
secretName: my-secret
secretNamespace: my-ns # default is metadata.namespace
targets:
# Each target has a name that can be used in the StorageClass
- name: foo
# Addresses of node(s) in the peer cluster
address:
- 1.1.1.1
- my.dns.com
# Credentials for setting up session (ssh user & key)
credentials:
secretName: my-secret
secretNamespace: my-ns # default is metadata.namespace
# Only PV-based nodes are built from templates
nodeTemplates: # (optional)
- name: myTemplate
# Zone is the "failure domain"
zone: my-zone # default is .nodeTemplates.name
thresholds:
nodes: 7 # may only be specified if other fields are absent
minNodes: 3
maxNodes: 42
freeStorageMin: 1Ti
freeStorageMax: 3Ti
nodeAffinity: # (optional)
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
# Would include "zone"-level affinity
# Operator will overlay best-effort pod anti-affinity
...
storage:
storageClassName: my-sc
capacity: 1Ti
status:
# TBD operator state
...
All CRs live within the operator.gluster.org
group and have version
v1alpha1
. The cluster my-cluster
, above would be contained within the gcs
namespace. All components of the Gluster cluster would be expected to exist
within this single namespace. The spec
field provides the main configuration
options.
The clusterOptions
section are Gluster options (i.e., normally manipulated
via the cli gluster vol set
) that do not take a volume parameter.
The drivers
list provides the list of CSI drivers that will be deployed by
the operator for use with this Gluster cluster.
The glusterCA
field holds a reference to a Kubernetes Secret containing the
certificate authority .key
and .pem
files from which both client and server
TLS keys can be generated. These will be used to automatically configure data
encryption between the CSI driver and the Gluster bricks.
The replication
set of parameters define the geo-replication configuration
for this cluster, optionally as both a source and target. If this cluster is to
be used as a target for replication, the replication.credentials
field must
be supplied. This is a reference to a Secret that contains the inbound ssh user
& key information. If this cluster is used as a source, the replication targets
are presented as a list in replication.targets
, providing a name
for each
remote cluster, the address(es) via address
, and the ssh credentials via the
credentials
field.
The nodeTemplates
list provides a set of templates that the operator can use
to automatically scale the Gluster cluster as required and to automatically
replace failed storage nodes.
Within this template, there is a zone
tag that allows the nodes created from
this template to be assigned to a specific failure domain. The default is to
have the zone name equal to the template name. These zones can then be used to
direct storage placement by referencing them in the StorageClass. Unless
otherwise specified, volumes will be created with bricks from different zones.
The thresholds
block places limits on the amount that the operator can scale
each template up or down. Additionally, it provides thresholds to determine
when scaling should be invoked. The template can have a fixed (constant) number
of nodes by setting nodes
to the desired value. The operator can also
dynamically size the template if, instead of setting nodes
, the minNodes
,
maxNodes
, freeStorageMin
, and freeStorageMax
are configured. In this
case, the number of storage nodes always remains between the min and max, and
scaling within that range is triggered based on the amount of free storage (not
assigned to a brick) exists across the nodes in that template.
Each template is likely to have a nodeAffinity
entry to guide the placement
of the Gluster pods to a single failure domain within the cluster.
The storage
block defines how the backing storage for the templated nodes are
created. This includes the name of a StorageClass that can be used to allocate
block-mode PVs, and the capacity that should be requested from this class.
Node CR
The Node CR defines a single Gluster server that is a part of the cluster. These node objects can either be created automatically by the operator from a template or they can be manually created.
apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
# Name for this node
name: az1-001
# CRD is namespaced
namespace: gcs
annotations:
# Applied by operator when it creates/manages this object from a template.
# When this is present, contents will be dynamically adjusted accd to the
# template in the cluster CR.
# When this annotation is present, the admin may only modify
# .spec.desiredState or delete the CR. Any other change will be
# overwritten.
anthill.gluster.org/template: template-name
spec:
# Nodes belong to a cluster
cluster: my-cluster
# Nodes belong to a zone
zone: az1
# Admin (or operator) sets desired state for the node.
desiredState: enabled # (enabled | disabled)
# Only 1 of external | storage
external:
address: my.host.com
credentials:
secretName: my-secret
secretNamespace: my-ns # default is metadata.namespace
storage:
# Only 1 of device | pvcName
# Device names must to be stable on the host
- device: /dev/sd[b-d]
pvcName: my-pvc
tags: [tag1, tag2]
nodeAffinity:
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
# For admin created GlusterNodes, this needs to specify a node selector
# that matches exactly one node. For template-based GNs, this will inherit
# from the template.
...
status:
# TBD operator state
# Possible states: (enabled | deleting | disabled)
currentState: enabled
There will be one Node object for each Gluster node in the cluster. By manipulating this object, the administrator can perform a number of maintenance actions:
- By deleting a given Node object, the administrator signals the operator that the corresponding Gluster node should be de-commissioned and removed from the cluster. In the case of a converged deployment, the resources of the corresponding pod would also be freed.
- By changing the
.spec.desiredState
of the Node, the administrator can notify the operator (and by extension, other Gluster management layers) that the particular node should be considered "in maintenance" (disabled
) such that it could be down for an extended time and should not be used as the source or target of migration, nor should it be used for new data allocation.
The annotation anthill.gluster.org/template
, when present, indicates that
this node object was created by the operator from the named template field. As
such, the operator will keep the fields of this object in-sync with the
template definition. When this annotation is present, the administrator should
not modify any field other than .spec.desiredState
. However, the
administrator may still delete the object to signal that it should be
decommissioned. Manually created GlusterNode
objects should not have this
annotation, and the administrator is free to modify all .spec.*
fields.
Within the .spec
, the cluster
field contains the name of the Gluster
cluster to which this node belongs, and the zone
field denotes the failure
domain zone name for this node. The desiredState
field denotes whether this
node should be considered disabled (not used for new allocation and may be
unavailable as well).
Only one of external
or storage
may be present. If external
exists, this
object represents a Gluster server that is running external to the Kubernetes
cluster. The fields in this section, provide the connection information
(address
) to access this node. For cases where Heketi and glusterd are in
use, the credentials
field can be used to provide authentication information
so that Heketi can ssh to the node to perform management operations. This field
is not necessary when running glusterd2.
Converged nodes (running as pods) will have a storage
section. This section
provides a list of either devices or PersistentVolumeClaims that will be used
by the node for creating bricks. Care must be taken when providing device names
directly that the devices corresponding to the provided names remain constant
at all times.
The nodeAffitity
section provides the ability to limit the cluster nodes to
which this Gluster server can be assigned. In the case of specifying devices
directly, this should be used to limit the Gluster node to a single Kubernetes
node. When using a PVC, the node affinity should be such that the PVC is
accessible from the nodes that match the affinity, and the affinity should be
further restricted to comply with the desired failure domain zone tag.
Examples
Below are some example Gluster configurations using the custom resources defined above.
AWS cluster, single AZ
This provides a very simple, single availability zone deployment with most options remaining as default. The Gluster pods can be placed arbitrarily within the cluster, and the number of nodes can be scaled as required to meet capacity demands.
apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
name: my-cluster
namespace: gluster
spec:
drivers:
- gluster-fuse
glusterCA:
secretName: ca-secret
nodeTemplates:
- name: default
thresholds:
minNodes: 3
maxNodes: 99
maxStorage: 100Ti
freeStorageMin: 500Gi
freeStorageMax: 2Ti
storage:
storageClassName: ebs
size: 1Ti
AWS cluster, multi AZ
Building upon the previous single-AZ deployment is the following configuration
that uses three different AZs for Gluster pods. Here, each Zone definition
provides a unique storageClassName
to ensure the pod's backing storage is
allocated from the correct EBS AZ, and it provides nodeAffinity
such that the
Gluster pod will be placed in on a node that is compatible with the chosen EBS
AZ.
The zone names used here (az1a
, az1b
, and az1c
) can be referenced in the
CSI driver's "data zones" list to control placement.
apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
name: my-cluster
namespace: gluster
spec:
drivers:
- gluster-fuse
glusterCA:
secretName: ca-secret
nodeTemplates:
- name: az1a
thresholds:
minNodes: 3
maxNodes: 99
maxStorage: 100Ti
freeStorageMin: 500Gi
freeStorageMax: 2Ti
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1a
storage:
storageClassName: ebs-1a
size: 1Ti
- name: az1b
thresholds:
minNodes: 3
maxNodes: 99
maxStorage: 100Ti
freeStorageMin: 500Gi
freeStorageMax: 2Ti
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1b
storage:
storageClassName: ebs-1b
size: 1Ti
- name: az1c
thresholds:
minNodes: 3
maxNodes: 99
maxStorage: 100Ti
freeStorageMin: 500Gi
freeStorageMax: 2Ti
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east-1c
storage:
storageClassName: ebs-1c
size: 1Ti
Bare-metal or virtualized on-prem
With an on-prem installation, it is likely that raw storage will be exposed to nodes as direct-attached storage. This would be either as physical disks (for bare metal) or as a statically mapped device (VMWare) or LUN. In these cases, local block-mode PVs would be used for the storage backing the Gluster pods, leading to template definitions very similar to above:
apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
name: my-cluster
namespace: gluster
spec:
drivers:
- gluster-fuse
glusterCA:
secretName: ca-secret
nodeTemplates:
- name: default
thresholds:
minNodes: 3
maxNodes: 99
maxStorage: 100Ti
freeStorageMin: 500Gi
freeStorageMax: 2Ti
storage:
storageClassName: local-pv
size: 1Ti