Clustering
This section provides an overview of the components and requirements for configuring a cluster with Fusion File Share Server. It does not include a step-by-step guide for setting up a cluster. Cluster setup is varies significantly depending on the specific environment and organizational requirements.
For detailed examples of configuring clustered Fusion File Share Server installations, refer to the Active-Active Clustering and Active-Passive Clustering howto guides.
Introduction
Fusion File Share Server supports clustering features that provide Continuous Availability (CA), Scale-Out, and High Availability (HA). These features complement each other to provide robust, reliable, and scalable performance:
-
Continuous Availability:
Continuous availability ensures that, in the event of a cluster node failure, another node takes over, providing uninterrupted service to all clients. This feature is especially critical for SMB servers, where maintaining active connections without disruption is essential. If a node fails, continuous availability mechanisms enable the new node to handle both new client requests and existing SMB connections from the failed node.
-
Scale-Out:
Scale-out is a strategy characterized by adding nodes to a system to handle increased load, rather than enhancing the existing nodes' capacity. This distributes workloads across multiple servers, allowing the system to handle more requests and efficiently serve more clients.
In the context of an SMB server, scale-out enables seamless expansion as demand increases, ensuring high performance and maintaining service reliability.
-
High Availability:
High availability combines the benefits of continuous availability and scale-out, allowing the system to handle increased loads by adding nodes while maintaining uninterrupted service during node failures. This approach is ideal for environments where both performance and fault tolerance are critical, ensuring scalability without compromising availability.
Based on fault tolerance and performance requirements, Fusion File Share Server can be configured in two modes, Active-Passive and Active-Active:
-
Active-Passive:
Active-passive is the simplest clustering configuration for Fusion File Share Server. However, among the three benefits of clustering, it provides only continuous availability. In this mode, only one node is active at a time, with standby node(s) ready to take over in the event of a failure. A failover process is required during this transition, causing existing SMB connections and file operations to hang briefly—typically for a few seconds—until failover completes.
The simplicity of this configuration makes it ideal for environments where ease of setup and maintenance are a priority.
Active-passive clustering is well-suited for business-critical applications where the cost of downtime is high, but a single SMB server is sufficient to handle the workload. It provides a cost-effective solution for ensuring continuous availability without complex setup.
-
Active-Active:
Active-active is a clustering configuration for Fusion File Share Server that provides all three benefits of clustering: continuous availability, scale-out, and high availability. In this mode, multiple servers operate simultaneously as a cluster, distributing connections and workloads across all nodes. This approach leverages the combined networking bandwidth of all servers. In the event of a node failure, only a subset of clients is affected, as their connections are automatically resumed on another server, virtually eliminating downtime.
This configuration is ideal for environments requiring minimal downtime, with scalability to handle increased loads by adding additional servers. However, it is more complex to configure and manage compared to active-passive clustering.
Common use cases for active-active clustering include:
-
Business-Critical Applications in Larger Environments:
Active-active clustering is often used in large environments where downtime is costly and a single SMB server cannot handle the workload.
-
Environments with High Bandwidth and Low Latency Demands:
Active-active clustering is commonly deployed where bandwidth demands are high. Examples include:
- Video Production: Enables multiple users to work on different projects simultaneously, requiring substantial bandwidth.
- Low-Latency Media Streaming: Allows to support a large number of clients, requiring high bandwidth and cannot tolerate buffering delays.
-
Clustering features typically incur a performance overhead, since they perform caching and persisting the server's state, and coordinating between nodes.
Fusion File Share Server as Part of Your CA/HA Solution
Fusion File Share Server plays a critical role in delivering a continuously available, highly available, and scalable file sharing solution. Its primary function is the high-performance handling of the SMB protocol. However, its effectiveness depends on the supporting infrastructure to to coordinate behavior across cluster nodes. This infrastructure includes:
- Shared Storage:
One or more file systems accessible by all nodes in the cluster. This shared storage holds the files to be served by Fusion File Share Server, as well as the server's configuration and persistent state. - Properly Configured Networking:
A network layout and configuration that facilitates communication between all nodes and clients, including a mechanism to distribute client connections across all nodes in the cluster. - Clustering Software:
Software that manages the communication between the nodes in the cluster, coordinates internal state, and ensures that the cluster operates as a single entity.
Fusion File Share Server does not provide or mandate specific infrastructure components or clustering software. Instead, it integrates with the solutions chosen by the customer. It is the customer's responsibility to determine and implement shared storage, networking, and load balancing solutions.
To ensure reliability, redundancy in critical services—such as storage and networking—is essential. The cluster's overall stability often hinges on its weakest points. While established best practices exist, Fusion File Share Server's modular design offers flexibility in selecting the components best suited to your specific environment.
Shared Storage and Persistent State
In both active-passive and active-active configurations, shared storage is accessible by all nodes in the cluster. This shared storage serves the following purposes:
- Data storage – the files to be shared via Fusion File Share Server.
- Fusion File Share Server configuration:
- The Fusion File Share Server configuration file.
- The user database file (if the file-backed user database is used for authentication).
- Persistent state and metadata, including information about ongoing operations and the state of the server and clients:
- The persistent file handle database.
- The connection recovery database.
- The privilege database.
For active-passive configurations, all shares' vfs
parameters should be set to libc:force_sync
.
Mount Points
When setting up a cluster, ensure that shared storage mount points and mount options are configured identically across all nodes. Whether using a single device or multiple, they must be mounted on the same mount points on all nodes. This consistency is essential becuase Fusion File Share Server shares the same configuration files across nodes, where the mount points are defined.
File System
For active-active configurations, a file system that can be mounted read-write on all nodes simultaneously is required. Such file systems include (but are not limited to):
- GlusterFS
- WekaFS
- GFS2
- OCFS2
- Lustre
- CephFS
- GPFS
For active-passive configurations, the file system requirements are more relaxed. To minimize failover times, consider using a distributed or clustered file system. However, if failover time is not a concern, a non-clustered file system such as the following may be used:
- Microsoft NTFS by Tuxera
- ext4
- XFS
- ZFS
- Any other file system that suits your needs.
Storage Redundancy
While not required by Fusion File Share Server, using redundant storage controllers, network connections, and storage arrays is highly recommended. Configuring drives in RAID and utilizing multipathing can further enhance fault tolerance and system availability.
Networking and Load Balancing
In addition to providing redundant network connections through multiple NICs connected to different switches, there are additional considerations when setting up networking for your Fusion File Share Server cluster.
Active-Passive and Floating IPs
In an active-passive configuration, a floating IP address acts as a single point of access to the cluster. This IP address is assigned to the active node and, in the event of a failure, is reassigned to the standby node. Clients connect to the server using this IP address. The health of the nodes and the failover process are typically managed by clustering software such as Pacemaker.
Active-Active and Load Balancing
For active-active configurations, each server has its own active IP address that clients can connect to. Typically, rudimentary load balancing can be performed by DNS round-robin, where multiple IP addresses are associated with the same domain name. This method allows clients to connect to different servers within the cluster.
While DNS is the most commonly used method for distributing client connections across multiple SMB servers, alternative name resolution methods can be used, such as editing the hosts file on server and client machines. Regardless of the chosen method, it is critical to ensure that:
- Clients can successfully connect to all servers in the cluster.
- The servers within the cluster can communicate effectively with one another.
Network Separation
To optimize performance and reliability, it is recommended to separate network traffic between cluster nodes (primarily used by the clustering software) from traffic between clients and servers. This can be achieved through:
- Separate network interfaces
- Configured VLANs
- Dedicated physical networks
This separation ensures that internal cluster communication, which requires the lowest possible latency and collision rate, is unaffected by client traffic, which may saturate the network under heavy load.
Network Redundancy
Although not a strict requirement of Fusion File Share Server, implementing redundant network connections, switches, and routers is strongly recommended. These measures enhance fault tolerance and ensure high availability.
Clustering Software
Clustering software refers to solutions that manage communication between cluster nodes, coordinate internal state, and ensure that the cluster operates as a single entity. These tools perform critical tasks such as:
- Monitoring node health
- Detecting failures
- Initiating failover processes
While Fusion File Share Server does not include clustering software, it integrates with external tools to support these functions:
- Corosync Cluster Engine
: Provides reliable messaging between nodes and establishes quorum to maintain cluster integrity.
- Pacemaker
: An open-source, high-availability cluster resource manager that monitors node health and coordinates failover processes.
Configuring Fusion File Share Server for Clustering
To enable continuous availability, scale-out, and high-availability capabilities, Fusion File Share Server relies on sever key clustering mechanisms:
- Shared Configuration: Ensures that all nodes in the cluster behave and present themselves as a single SMB server.
- Persistent File Handle Database: Allows clients to continue accessing open files on another node in the event of a failure.
- Connection Recovery Database: Facilitates immediate resumption of communication with clients after a node failure, avoiding the need for clients to time out and reconnect.
- Scale-Out: Allows additional servers to be added to the cluster to handle increased loads.
For Active-Passive configurations, you need to enable and configure the Persistent File Handle Database and the Connection Recovery Database.
For Active-Active configurations, you must enable and configure all these mechanisms.
Prerequisites
Before configuring Fusion File Share Server for clustering, ensure that the following prerequisites are met:
Shared Storage
Configure the file system(s) to be accessible by all nodes in the cluster:
- For active-active configurations, the file system must be mounted read-write on all nodes simultaneously.
- For active-passive configurations, the file system must be mounted on at least the active node (and optionally on the passive node).
The considerations for the number of devices, volumes, and mount points are entirely at your discretion, based on your storage infrastructure's performance metrics, file system capabilities, and security requirements. The following configurations are valid as long as they meet these criteria:
- A single mount point for all shares, including configuration files and persistent state.
- Separate mount points for configuration files and the persistent state, with individual mount points for each share.
- Individual mount points for each share, configuration file, and persistent state.
- Any other configuration that meets your requirements.
Networking
Ensure the network is configured to enable communication between all nodes and clients:
- For active-passive configurations, configure a floating IP address.
- For active-active configurations, set up a load balancing mechanism, such as DNS round-robin.
- Optionally, set up a dedicated network for communication between the nodes. (Recommended)
Clustering Software
Install and configure clustering software (i.e., Corosync and Pacemaker) on all nodes in the cluster:
- For active-passive configurations, the clustering software should be configured to monitor nodes and initiate failover processes when necessary. This involves:
- Moving the floating IP address to the standby node.
- Mounting shared storage on the standby node, if required.
- Starting the Fusion File Share Server service on the standby node using the configuration file on the shared storage.
- For active-active configurations, the clustering software should be configured to continuously monitor node health, ensuring only nodes that meet storage, network, and Fusion File Share Server service requirements remain in the cluster.
Shared Configuration
Place the Fusion File Share Server configuration file on the shared storage and ensure that all nodes in the cluster can access it. The configuration file must be identical across all nodes and contain the configuration for all shares as well as the global configuration parameters.
When running Fusion File Share Server in a clustered configuration, use the tsmb-cfg
utility for all updates to the configuration. This ensures changes are consistently applied across all nodes in the cluster.
Scale-Out
Fusion File Share Server supports the scale-out functionality, allowing multiple servers to run Fusion File Share Server software as part of a single installation. Configuring scale-out involves simply configuring the scale-out mode:
-
Scale-Out Disabled:
When scale-out is disabled, Fusion File Share Server behaves as a single active server. This is the appropriate setting for standalone or active-passive configurations.
-
Scale-Out Enabled:
When scale-out is enabled, Fusion File Share Server behaves as part of a server cluster, suitable for active-active configurations. In this mode, nodes in the cluster maintain a shared FSA state over Corosync to determine the state of the all open files and the order of operations on them.
noteUpon startup in scale-out mode, Fusion File Share Server attempts to join the cluster by connecting to the Corosync cluster engine. If Corosync is not installed on the node, Fusion File Share Server will fail to start.
-
Autonomous Mode:
In autonomous mode, Fusion File Share Server allows multiple servers to run simultaneously in a cluster without sharing the FSA state. This is useful in very large clusters where maintaining a shared FSA state would incur a significant overhead.
In this mode, each server maintains its own FSA state, and the servers do not need to establish consensus on order of operations such as opening or acquiring locks on files. However, this can cause issues if multiple clients access the same file on different nodes and one or more clients attempt to write to it. In autonomous mode, because the shared-mode is not synchronized between server nodes, this would be allowed but would likely result in data corruption.
The autonomous mode is only suitable for specific workloads, such as:
- Read-heavy workloads, where the majority of operations are read operations.
- Sharded workloads, where each client has its own share or directory, and the servers do not have to coordinate on the same files.
- Append-only workloads, where the servers do not have to coordinate on the same files.
Use autonomous mode with caution!Enable autonomous mode only if your workload is well-suited for it.
Enabling autonomous mode for an unsuitable workload can result in data corruption, data loss, or other unexpected behavior, as this mode does not strictly conform to the behavior clients typically expect from an SMB server.
Enabling Scale-Out
To enable scale-out, set the following parameter:
- Configuration file's
[global]
section:scale_out
tsmb-cfg global update
: Not supportedtsmb-cfg global add
andtsmb-cfg global del
: Not supported
Value Type: string
Value Format: true|false|autonomous
:
true
: Enables scale-out.false
: Disables scale-out.autonomous
: Enables scale-out, but without synchronizing the FSA state between nodes.
Default Value: true
Persistent File Handle Database
In both active-passive and active-active configurations, clients must be able to resume work on another node in the event of a failure. This capability is achieved by storing information about open file handles in shared storage accessible by all active nodes. When a client reconnects to a new node, the node can seamlessly resume the client's work because it has access to the open file handle information.
When scale-out is enabled, the file handle database is maintained in-memory on all nodes, making the configuration of the persistent file handle database optional.
Configuring the Persistent File Handle Database
To set the location of the persistent file handle database per share, use the following parameter:
- Configuration file's
[global]
section:ca
tsmb-cfg global update
: Not supportedtsmb-cfg global add
andtsmb-cfg global del
: Not supported
ca
Value Type: boolean
Value Format: true|false
true
: Enables the persistent file handle database globally.false
: Disables the persistent file handle database globally.
Default Value: false
Setting ca
to true
, either globally or per-share, is not the equivalent of Enabling CA in Windows Server. The main distinction is that the CA feature in Windows implies synchronous writes of all data to disk. In Fusion File Share Server, to replicate this behavior, you must also set vfs
to libc:force_sync
on all shares. This configuration is required for active-passive configurations.
- Configuration file's
[global]
section:ca_path
tsmb-cfg global update
: Not supportedtsmb-cfg global add
andtsmb-cfg global del
: Not supported
Value Type: string
ca_params
Value Format: <path>
<path>
specifies the path on a shared storage where the Fusion File Share Server stores its persistent file handle database. This path must be accessible by all nodes in the Fusion File Share Server cluster to support continuous or high availability. If not overridden on a per-share basis using the optional<path>
portion of the share'sca_params
parameter, the path of the the persistent file handle database for each share with continuous availability enabled defaults to<path>/<netname>
(where<netname>
is the share'snetname
parameter).
Default Value: none
Examples:
/mnt/shared/ca
stores the persistent file handle database in/mnt/shared/ca/<share_name>
for each share where continuous availability is enabled.
Example
The following example assumes you have a shared storage volume mounted at /mnt/shared/
:
[global]
...
ca = true
ca_path = /mnt/shared/_ca
...
[/global]
[share]
netname = sh1
path = /mnt/shared/sh1
...
[/share]
[share]
netname = sh2
path = /mnt/shared/sh2
ca = false
...
[/share]
[share]
netname = sh3
path = /mnt/shared/sh3
ca_params = path=/mnt/shared/sh3_ca,durable
...
[/share]
The above configuration:
- Sets the global location of the persistent file handle database to
/mnt/shared/_ca
. - Sets the location of the persistent file handle database for share
sh1
to/mnt/shared/_ca/sh1
, since the globalca
parameter is set totrue
, and the share'sca
andca_params
parameters are not set. - Does not store the persistent file handle database for share
sh2
, since the share'sca
parameter is explicitly set tofalse
. - Sets the location of the persistent file handle database for share
sh3
to/mnt/shared/sh3_ca
, since the share'sca_params
contains apath=...
component which overrides the globalca_path
parameter. - Persists durable handles for share
sh3
, since itsca_params
has thedurable
flag set.
Connection Recovery and the Connection Recovery Database
The connection recovery database stores information about ongoing TCP connections between clients and server nodes. This information in this database is essential for recovering connections in the event of a node failure. The connection recovery database is stored on shared storage, ensuring it is accessible to all nodes in the cluster.
When an active node takes over for a failed node, it reads the connection recovery database to identify which connections to resume, using the TCP Tickle ACK technique.
TCP Tickle is a mechanism used to maintain the state of an idle TCP connection. Fusion File Share Server leverages the TCP Tickle ACK technique, which involves sending an ACK packet to the client with invalid fields, such as a bogus sequence number. This triggers a series of TCP exchanges, prompting the client to recognize its connection to the failed node as stale, and reconnect to the newly active node.
- Configuration file's
[global]
section:tcp_tickle
tsmb-cfg global update
: Not supportedtsmb-cfg global add
andtsmb-cfg global del
: Not supported
Value Type: boolean
Value Format: true|false
true
: Enables connection recovery with TCP tickle.false
: Disables connection recovery with TCP tickle.
Default Value: false
- Configuration file's
[global]
section:tcp_tickle_params
tsmb-cfg global update
: Not supportedtsmb-cfg global add
andtsmb-cfg global del
: Not supported
Value Type: string
Value Format: path=<path>
<path>
specifies the path to the connection recovery database on the shared storage.
Default Value: none
Examples:
/mnt/shared/cr
stores the connection recovery database in the specified directory.