Skip to main content

Connection Failover and Recovery

You may setup a transparent failover and continuous availability in Fusion File Share in an active-passive mode using the following instruction. Fusion File Share stores connections and open handles (durable and persistent) in file system. Hence it is possible for clients to recover and reconnect connections seamlessly. Please also refer to section Continuous Availability (CA) for per-share configuration.

You need to configure both global setting and per-share setting for CA.

In order to setup the active-passive mode, you need at least two nodes, primary (active) and backup (passive) nodes. In case the primary node disappears from network, either network subsystem was shut down, or the node was hard-reset or some kernel panic, etc., the backup node starts Fusion File Share. Fusion File Share on backup node (now active node) looks through the connection recovery DB stored under a persistent volume shared across cluster, between the nodes and sends TCP Tickle ACKs to each of the client to trigger a TCP reconnect. It is important that Fusion File Share is started in the backup node only after the floating IP has been assigned to the backup node.

You may use Fusion File Share for CA on a single node. This is useful for SW update with CA, i.e., you may upgrade Fusion File Share to a newer version on the same node without losing client handles. Simply shut down Fusion File Share server and start a newer version of Fusion File Share on the same node. As long as failover and CA settings are properly configured, the update should be transparent to the connected client.

Connection failover and recovery depends on TCP tickle ack mechanism and you need to set tcp_tickle to true.

tcp_tickle = [true | false]

This feature allows a SMB client to automatically reconnect to a server, if the server node to which the connection was made had failed.

If the node to which the client made a connection fails, terminates or shuts down without any prior indication, the client detects the unavailability only on encountering a time-out or via some form of keep-alive mechanism. This aspect of connection recovery is unreliable and slow, and the implementation details varies from one application to another.

TCP Tickle ACK mechanism sends an ACK with invalid fields which triggers a set of TCP exchanges causing the client to promptly recognize a stale connection and reconnect.

A boolean, if set, enables the TCP tickle mechanism.

By default:

tcp_tickle = false

Connection Recovery DB

When using a shared volume in a cluster with active-passive configuration, the tickle mechanism relies on the shared volume to store connection entries in a database.

This can be done using the configuration option, ‘tcp_tickle_params’. This configuration option can accept a list of parameters separated by a comma ‘ , ’.

The list of supported parameters are:

path – Path to a dedicated directory on a volume shared between fail-over cluster nodes. Example configuration:

tcp_tickle_params = path=</path/to/directory>