You are deploying a new application cluster and wonder how it will perform under less-than-ideal conditions: heavy system load, slow storage, network performance degradation. Application resiliency testing is integral to any application architecture but is often passed over because the process is considered overly complex and time-consuming. Here are some technical suggestions to make resiliency testing a little easier.
Most applications designed to run in a high-availability clustered environment are very resilient to a partial loss of cluster components: servers crashing, disks failing, network links dropping, and so on. Traditionally, the weak spot of such designs has always been a partial and fleeting degradation in performance.
For example, an application can usually handle a server crash but will be thrown for a loop if one of the servers restarts unexpectedly. As the remaining cluster nodes are trying to re-distribute roles and workload, the missing server reboots and tries to re-join the cluster, usually causing much confusion. Or, if one of the network links drops, a cluster can handle this through link aggregation, among other methods. On the other hand, intermittent network degradation – a bandwidth bottleneck, high latency, and dropped packets – will usually result in application performance and stability problems.
Simulating such events during application performance testing will save you a lot of weekend work down the road. The specific commands below were used on HPE DL380 servers running RHEL 8.6, but you can easily adapt the syntax for most other modern Linux flavors.
System Pre-Requisites
Install these diagnostic tools and configure system parameters.
# Install the required testing tools yum -y install iproute-tc stress kernel-modules-extra kernel-modules-extra-$(uname -r) stress-ng bonnie++ # Load the required kernel modules modprobe sch_netem; lsmod | grep -c sch_netem | sed -e 's/1/OK/g' -e 's/0/FAIL/g' # View all tc qdiscs for the primary NIC tc qdisc show dev $(route | grep -m1 ^default | awk '{print $NF}') # Clear all tc qdiscs from the primary NIC tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null
Network Testing
Introduce a delay of 100ms with randomized +/-10ms uniform distribution and the correlation value of 25%
tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem delay 100ms 10ms 25%
Introduce a 10% packet loss
tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem loss 10%
Corrupt 5% of the packets by introducing single bit error at a random offset
tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem corrupt 5%
Duplicate 1% of sent packets
tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem duplicate 1%
Limit egress bandwidth to 128kbps with 32kbps burst and 100ms latency
tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root tbf rate 128kbit burst 32kbit latency 100ms
Clear all tc qdiscs from the primary NIC
tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null
System stress test
This is another realistic test that emulates system resource limitations caused by factors like runaway processes, hardware failures, and resource contentions.
Fully utilize half of all CPU cores and half of all memory for one minute:
stress --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --io 1 --vm 1 --vm-bytes $(echo "scale=0;$(grep MemTotal /proc/meminfo | awk '{print $2}') / 2" | bc -l)K --timeout 60
A variation of the previous test with additional disk I/O
stress --hdd 4 --io 6 --vm 8 --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --timeout 60
Test /mnt/app filesystem performance using Bonnie++
if [ $(mountpoint /mnt/app 2>/dev/null 1>&2; echo $?) -eq 0 ]; then bonnie++ -n 0 -u 0 -r $(free -m | grep 'Mem:' | awk '{print $2}') -s $(echo "scale=0;$(free -m | grep 'Mem:' | awk '{print $2}')*2" | bc -l) -f -b -d /mnt/app; fi
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.