Clustering allows eXo Platfrom users to run various portal instances on several parallel servers which are also called nodes. The load is distributed across different servers, so the portal is still accessible via other cluster nodes in case of any failed servers. Thanks to adding more nodes to the cluster, eXo Platform's performance can be much improved. A cluster is a set of nodes which is managed together and participate in the workload management. Installing eXo Platform in the cluster mode is considered in the following main cases:
Load Balancing: when a single server node is not enough for handling the load.
High Availability: if one of nodes is failed, the rest of nodes in the cluster can assume the workload of the system. It means that no access is interrupted.
These characteristics should be handled by the overall architecture of your system. The Load Balancing is typically achieved by a front server or device that distributes the request to the cluster nodes. Also, the High Availability on the data layer can be typically achieved using the native replication implemented by Relation Database Management System (RDBMS) or Shared File Systems, such as SAN and NAS.
In this chapter, only the changes which are necessary for eXo Platform to work in the cluster mode are covered as below:
In eXo Platform, the persistence mostly relies on JCR, which is a middleware between the eXo Platform applications (including the Portal) and the database. Hence, this component must be configured to work in the cluster mode.
The embedded JCR server requires a portion of its state to be shared on a file system shared among the cluster nodes:
The values storage.
The index.
All nodes must have the read/write access to the shared file system.
It is strongly recommended that you use a mount point on a SAN.
To set up the cluster in eXo Platform:
1. Switch to a cluster configuration.
This step is done in the configuration.properties file. This configuration.properties file must be set in the same way on all the cluster nodes.
First, point the exo.shared.dir variable to a network directory shared between cluster nodes.
exo.shared.dir=/PATH/TO/SHARED/FS
The path is shared, so all nodes will need the read/write access to this path.
Then, switch the JCR to the cluster mode.
gatein.jcr.config.type=cluster
In this step, JCR enables the automatic network replication and discovery between other cluster nodes.
2. Switch to the cluster profile. You need to indicate the cluster kernel profile to eXo Platform. This can be done by editing gatein.sh as below:
EXO_PROFILES="-Dexo.profiles=default,cluster"
or use the start_eXo script:
./start_eXo.sh default,cluster
3. Do the initial startup.
For the initial startup of your JCR cluster, you should only start a single node. This node will initialize the internal JCR database and create the system workspace. Once the initial node is definitely started, you can start the other nodes.
This constraint is only for the initial start. As above, you can start the nodes in any order, but it should be started fully from the single node. After that, others can start in any order or in parallel.
4. Start up and shut down.
Nodes of the cluster will automatically try to join others at startup. Once they have discovered each other, they will synchronize their state. During the synchronization, the node is not ready to serve requests.
The cluster mode is preconfigured to work out of the box. The eXo Platform clustering fully relies on the JBossCache replication which uses JGroups internally.
The default configuration of JBossCache lies in exo.portal.component.common-x.x.x.jar.
Since eXo Platform 3.5-M3, the JCR's JBossCache default configuration is externalized to the $CATALINE_HOME/gatein/conf/jcr/jbosscache folder.
The advanced configuration is optional. It is recommended that you should not do any advanced configuration.
The JBossCache configuration is done in the configuration.properties file.
On Tomcat:
# JBossCache configuration
gatein.jcr.jbosscache.config=file:${catalina.home}/${exo.conf.dir.name}/jcr/jbosscache
# JCR cache configuration
gatein.jcr.cache.config=${gatein.jcr.jbosscache.config}/${gatein.jcr.config.type}/cache-config.xml
gatein.jcr.cache.expiration.time=15m
# JCR Locks configuration
gatein.jcr.lock.cache.config=${gatein.jcr.jbosscache.config}/${gatein.jcr.config.type}/lock-config.xml
# JCR Index configuration
gatein.jcr.index.cache.config=${gatein.jcr.jbosscache.config}/${gatein.jcr.config.type}/indexer-config.xml
# JGroups configuration
gatein.jcr.jgroups.config=${gatein.jcr.jbosscache.config}/cluster/jgroups-udp.xml
On JBoss Application Server:
# JCR cache configuration
gatein.jcr.cache.config=classpath:/conf/jcr/jbosscache/${gatein.jcr.config.type}/cache-config.xml
gatein.jcr.cache.expiration.time=15m
# JCR Locks configuration
gatein.jcr.lock.cache.config=classpath:/conf/jcr/jbosscache/${gatein.jcr.config.type}/lock-config.xml
# JCR Index configuration
gatein.jcr.index.cache.config=classpath:/conf/jcr/jbosscache/${gatein.jcr.config.type}/indexer-config.xml
# JGroups configuration
gatein.jcr.jgroups.config=classpath:/conf/jcr/jbosscache/cluster/jgroups-udp.xml
By default, the nodes discovery is based on UDP, in which JGroups is responsible for the nodes identification through the UDP transport. The administrator can change the configuration of detection and ports in jgroups-udp.xml.
Q1. How to migrate from local to the cluster mode?
If you intend to migrate your production system from local (non-cluster) to the cluster mode, follow these steps:
1. Update the configuration to the cluster mode as explained above on your main server.
2. Use the same configuration on other cluster nodes.
3. Move the index and value storage to the shared file system.
4. Start the cluster.
Q2. Why is startup failed with the "Port value out of range" error?
On Linux, your startup is failed if you encounter the following error:
[INFO] Caused by: java.lang.IllegalArgumentException: Port value out of range: 65536
This problem happens under specific circumstances when JGroups-the networking library behind the clustering attempts to detect the IP to use for communication with other nodes.
You need to verify:
The host name is a valid IP address, served by one of the network devices, such as eth0, eth1.
The host name is NOT defined as localhost or 127.0.0.1.
Q3. How to solve the "failed sending message to null" error?
If you encounter the following error when starting up in the cluster mode on Linux:
Dec 15, 2010 6:11:31 PM org.jgroups.protocols.TP down SEVERE: failed sending message to null (44 bytes) java.lang.Exception: dest=/228.10.10.10:45588 (47 bytes)
Be aware that clustering on Linux only works with IPv4. Therefore, when using a cluster under Linux, add the following property:
-Djava.net.preferIPv4Stack=true