# RSSD Cloud Disk Mounting Problems 

In general, when mounting a cloud disk, it is necessary to ensure that it is under the same availability zone as the host. However, for RSSD cloud disks, in addition to the availability zone, it is also necessary to ensure that they are in the same `RDMA cluster`. `RDMA` is a concept smaller than availability zone, it is hidden in the console interface, and can only be specified and queried through the API.

Overall, an RSSD cloud disk has the following requirements for the host:

- Host must be O-type, i.e., Kuaijie-type host.
- The host and cloud disk are in the same availability zone.
- The host and cloud disk are in the same `RDMA`.

If the first two points are met but the third is not, the mounting will fail. The general error message for failure is (focus on error code `17218`):

```txt
[17218] Command failed: add_udisk.py failed
```

Another issue with `RDMA` is that it may change at any time, i.e., the host may migrate the `RDMA` of the cloud disk or the host, and it cannot notify downstream services after changing.

When CSI deals with Pods using RSSD cloud disks, it needs to solve the following two problems:

- When creating a new RSSD cloud disk, make sure its `RDMA` is consistent with the node where the Pod is located.
- When re-scheduling a Pod using an RSSD cloud disk, make sure that the scheduled node is O-type, and its `RDMA` is consistent with the cloud disk.

**Generally, users don't need to worry about these issues. However, due to historical design reasons of CSI, versions prior to `22.09.1` of CSI will encounter serious mount failure problems if `RDMA migration` occurs when dealing with RSSD cloud disks. Next, I will provide a detailed interpretation of CSI's scheduling mechanism for RSSD cloud disks to help you better understand this issue and know why it is essential to upgrade CSI to version `22.09.1` or above when using RSSD cloud disks.**

## Static Scheduling

> This is the plan adopted by CSI in versions below `22.09.1`.

In csi before `22.09.1`, it is achieved by adding `nodeAffinity` in pv. For more information about node affinity, please see the official documentation: [Assign Pods to Nodes using Node Affinity](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/).

For Kuaijie-type cloud hosts, a topology label is saved on the node to save the RDMA field, such as:

```yaml
topology.udisk.csi.ucloud.cn/rdma-cluster-id: 9002_25GE_D_R006
```

This indicates that this node is in the `RDMA cluster` `006`.

For RSSD cloud disk PV, RDMA will be saved in its `nodeAffinity`:

```yaml
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.udisk.csi.ucloud.cn/rdma-cluster-id
              operator: In
              values:
                - 9002_25GE_D_R006
```

These two fields are written by CSI and cannot be changed once written. When scheduling, the node affinity mechanism ensures that Pods using RSSD cloud disks are only scheduled to nodes that match RDMA:


If there is no change in RDMA, there is no problem, but once `RDMA migration` occurs, UK8S cannot detect this migration, nor can it update the information above, causing data inconsistency.

Even if UK8S can detect this migration, because `nodeAffinity` is `immutable`, we cannot update the information by updating the field.

And if inconsistencies occur, serious problems will occur. Assuming that the actual RDMA of a cloud disk is `005`, but the RDMA saved in UK8S is `006`, according to the node affinity, the Pod using it will be scheduled to a node with RDMA `006`, which does not match the actual RDMA. This will ultimately lead to a failure in cloud disk mounting.

If your CSI version is less than `22.09.1`, RSSD cloud disk mounting issues occur, and you find error code `17218` in the CSI log, then the problem is most likely caused by RDMA migration.

**It can be seen that we should not store RDMA information in the UK8S cluster in any way as this information is completely unreliable. We should obtain RDMA dynamically for scheduling.**

## Dynamic Scheduling

> This is the plan adopted in CSI version `22.09.1` and above (this version requires the Kubernetes version to be no less than 1.18).

In CSI version `22.09.1` and above, node affinity will no longer be used to schedule RSSD cloud disks, and all RDMA information will be dynamically obtained and scheduled. Considering no stock data, the issue of RDMA migration can be solved through dynamic scheduling.

To implement dynamic scheduling, dynamic logic needs to be inserted in at least two places:

- When creating an RSSD cloud disk, dynamically obtain the RDMA of the node.
- When scheduling RSSD cloud disks, dynamically obtain the RDMA of the cloud disk and nodes for matching.

The following two sections will introduce how CSI solves the above problems.

### Creating RSSD Cloud Disk

Creating an RSSD cloud disk is implemented in CSI, so we only need to change the logic of creating a disk in CSI:

- Remove the topology label: `topology.udisk.csi.ucloud.cn/rdma-cluster-id`.
- When creating a new PV, stop writing the `nodeAffinity`, `topology.udisk.csi.ucloud.cn/rdma-cluster-id`.
- When creating RSSD cloud disks, call the API to get the RDMA information of the node and pass it to the RSSD cloud disk creation interface.

In this way, newly created RSSD cloud disks will no longer rely on node affinity for scheduling.

### Scheduling RSSD Cloud Disk

The scheduler needs to be able to dynamically obtain the RDMA information of the RSSD when scheduling a Pod that includes RSSD. This requires calling the DezaiCloud API. The native kube-scheduler can not implement this, so we need to complete it with the extension mechanism provided by the kube-scheduler.

Here is a brief explanation, KubeScheduler provides two mechanisms for extending scheduling, namely Extender mechanism and Framework mechanism. Below is a brief explanation of their differences:

- `scheduler extender`: The scheduling plugin needs to be deployed on the master node. During scheduling, the scheduler will call extensions at different extension points via HTTP.
- `scheduler framework`: Compile a standalone scheduler completely, you can insert your own scheduling logic in it. Deployed in the cluster as a Deployment separately. If you want to use this scheduler, you need to modify the `schedulerName` setting.

Generally speaking, the official recommends using the second method to extend the scheduler, but we do not want to modify `schedulerName`, and the second method means that we need to maintain different versions of schedulers for different versions of Kubernetes, which will be more troublesome to maintain later on, so we use the first extension method.

This requires deploying a separate HTTP service on the master node of the cluster to implement its own scheduling logic, and then, the scheduling configuration needs to add `extenders` related content:

```yaml
extenders:
  - urlPrefix: http://127.0.0.1:6678/
    filterVerb: filter
    httpTimeout: 60s
```

This means a `filter` extension of scheduling, the extension calls the `http://127.0.0.1:6678/filter` interface.

> Special note: The `extenders` feature was introduced after the `scheduler's v1beta2` version. So if the Kubernetes version is lower than 1.19, we can not extend it. If you are using a lower version of Kubernetes, you should upgrade the Kubernetes version first.

This way, we can call the DezaiCloud API at the extension point to dynamically obtain the RDMA Cluster information of RSSD and the node, and filter the node based on this information:


The scheduler-extender will check whether the Pod is using the RSSD's PV. If so, it will call the DezaiCloud API to get the RDMA information of the PV and nodes, and filter out inconsistent nodes based on this information.

When deploying, you need to add a new `systemd` on the master node of your cluster, called `scheduler-extender-uk8s`. You can check the service health with the following command:

```shell
systemctl status scheduler-extender-uk8s
```

If there is a problem with the scheduling, you can check the log with the following command:

```shell
journalctl -u scheduler-extender-uk8s.service -f
```

When installing a cluster of 1.19 and above, uk8s will automatically install `scheduler-extender-uk8s` on all master nodes. For how to handle existing clusters, please refer to the content below.

## Upgrading CSI and Installing scheduler-extender via Console

**It must be noted here that the new version of CSI must be used in combination with the scheduler-extender. So when upgrading the `22.09.1` CSI, please operate in the console instead of modifying the image manually.**

The new CSI version will integrate the scheduler-extender version, formatted as `{csi-version}-se{scheduler-extender-version}`, for example, `22.09.1-se22.08.3` indicates that the CSI version is `22.09.1` and the scheduler-extender version is `22.08.3`. If the scheduler-extender is not installed in the cluster, the format is `{csi-version}-se-unknown`, such as `21.09.1-se-unknown`.

By using this integrated version number, CSI and scheduler-extender can be easily binded together, and there is no need to add a separate scheduler-extender plugin management page.

### Query Version

You can query the CSI version directly by checking the `image` in the `StatefulSet`. Querying the `scheduler-extender` version is more complex because it requires logging into the master node of the cluster and calling the command. The scheduler-extender is deployed through systemd.

Here, asynchronous tasks need to be involved. We need to deploy Jobs on the master node to complete the scheduler-extender version query. The original CSI version query was completed synchronously, and it needs to be changed to asynchronous call, similar to CNI version query.

### Version Upgrade Inconsistency Issue

From now on, scheduler-extender needs to be upgraded or installed before upgrading CSI when upgrading CSI. This is because the new CSI must rely on scheduler-extender to function. If the upgrade or installation of scheduler-extender fails, the entire CSI upgrade process must be halted.

The possible inconsistency here is that scheduler-extender was installed successfully, but CSI was not upgraded successfully. This means that the two constraints for scheduling the RSSD PV in the cluster coexist through the scheduler-extender and nodeAffinity. Both constraints ensure that the RSSD PV is scheduled to a node that is consistent with RDMA Cluster, but one is dynamic and one is static. The coexistence of two constraints is actually not conflicting.

In summary:

- We can tolerate the success of scheduler-extender installation, but CSI upgrade failure. Because this is equivalent to having two constraints at the same time, let the customer retry upgrading CSI subsequently.
- It can't be tolerated to upgrade CSI in the situation where the scheduler-extender is not installed, because it means there are no constraints on RSSD PV in the cluster.

Therefore, we didn't do the upgrade fail rollback, just make sure to upgrade scheduler-extender first, then upgrade CSI.

In addition, when upgrading CSI, we also added the following constraints:

- If the cluster version is less than 1.19.x, CSI upgrade is not allowed, customers need to upgrade the cluster version first.
- CSI upgrade is not allowed if there are PVs in the customer's cluster that includes RDMA nodeAffinity. This requires manual intervention to hack data (see the following section on handling historical stock data).

## Dealing with Historical Stock Data

The above has solved all problems of creating new RSSD cloud disks, but lacks processing for stock data.

Generally, what comes to mind is to remove the nodeAffinity data on PV after the upgrade completes. However, Kubernetes has a quite troublesome design, and that is `nodeAffinity` is `immutable`, and cannot be modified directly. As long as `nodeAffinity` exists, kube-scheduler will constrain PV based on node affinity. When disk migration occurs, problems will arise.

We need to use special means to remove nodeAffinity on PV. Here, the means are very hack, and it requires directly modifying the data in etcd. Therefore, this will be a very unmapped operation, and cannot be integrated into automated tools. It requires manual intervention.

**If your cluster contains RSSD cloud disk data, please contact our technical support. We will manually fix the data for you.**