Home Monitoring Performance Server and Application Resiliency Testing

Server and Application Resiliency Testing

September 9, 2022

You are deploying a new application cluster and wonder how it will perform under less-than-ideal conditions: heavy system load, slow storage, network performance degradation. Application resiliency testing is integral to any application architecture but is often passed over because the process is considered overly complex and time-consuming. Here are some technical suggestions to make resiliency testing a little easier.

Most applications designed to run in a high-availability clustered environment are very resilient to a partial loss of cluster components: servers crashing, disks failing, network links dropping, and so on. Traditionally, the weak spot of such designs has always been a partial and fleeting degradation in performance.

For example, an application can usually handle a server crash but will be thrown for a loop if one of the servers restarts unexpectedly. As the remaining cluster nodes are trying to re-distribute roles and workload, the missing server reboots and tries to re-join the cluster, usually causing much confusion. Or, if one of the network links drops, a cluster can handle this through link aggregation, among other methods. On the other hand, intermittent network degradation – a bandwidth bottleneck, high latency, and dropped packets – will usually result in application performance and stability problems.

Simulating such events during application performance testing will save you a lot of weekend work down the road. The specific commands below were used on HPE DL380 servers running RHEL 8.6, but you can easily adapt the syntax for most other modern Linux flavors.

System Pre-Requisites

Install these diagnostic tools and configure system parameters.

# Install the required testing tools
yum -y install iproute-tc stress kernel-modules-extra kernel-modules-extra-$(uname -r) stress-ng bonnie++

# Load the required kernel modules
modprobe sch_netem; lsmod | grep -c sch_netem | sed -e 's/1/OK/g' -e 's/0/FAIL/g'

# View all tc qdiscs for the primary NIC
tc qdisc show dev $(route | grep -m1 ^default | awk '{print $NF}')

# Clear all tc qdiscs from the primary NIC
tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

Network Testing

Introduce a delay of 100ms with randomized +/-10ms uniform distribution and the correlation value of 25%

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem delay 100ms 10ms 25%

Introduce a 10% packet loss

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem loss 10%

Corrupt 5% of the packets by introducing single bit error at a random offset

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem corrupt 5%

Duplicate 1% of sent packets

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root netem duplicate 1%

Limit egress bandwidth to 128kbps with 32kbps burst and 100ms latency

tc qdisc add dev $(route | grep -m1 ^default | awk '{print $NF}') root tbf rate 128kbit burst 32kbit latency 100ms

Clear all tc qdiscs from the primary NIC

tc qdisc del dev $(route | grep -m1 ^default | awk '{print $NF}') root 2>/dev/null

System stress test

This is another realistic test that emulates system resource limitations caused by factors like runaway processes, hardware failures, and resource contentions.

Fully utilize half of all CPU cores and half of all memory for one minute:

stress --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --io 1 --vm 1 --vm-bytes $(echo "scale=0;$(grep MemTotal /proc/meminfo | awk '{print $2}') / 2" | bc -l)K --timeout 60

A variation of the previous test with additional disk I/O

stress --hdd 4 --io 6 --vm 8 --cpu $(echo "scale=0;$(grep -c proc /proc/cpuinfo) / 2" | bc -l) --timeout 60

Test /mnt/app filesystem performance using Bonnie++

NOTE: This test will set the size of the test file to twice the available memory. This is required for accurate performance data. If your server has a lot of RAM, the test will take a long time to complete.

if [ $(mountpoint /mnt/app 2>/dev/null 1>&2; echo $?) -eq 0 ]; then bonnie++ -n 0 -u 0 -r $(free -m | grep 'Mem:' | awk '{print $2}') -s $(echo "scale=0;$(free -m | grep 'Mem:' | awk '{print $2}')*2" | bc -l) -f -b -d /mnt/app; fi

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Symbol	USD	% 1h	% 24h	% 7d
BTC	37,157	0.55	2.50	7.72
ETH	1,716.5	0.31	3.66	4.71
USDT	1.000	0.00	0.01	0.02
XRP	0.3813	0.14	0.63	2.13
BNB	604.28	0.16	2.01	2.60
SOL	147.93	0.13	1.23	6.13
USDC	1.0000	0.00	0.00	0.00
DOGE	0.1623	0.08	4.62	0.85
	?	---	0.00	0.00
	?	---	0.00	0.00

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

Adding and Removing sshd instances on CentOS 7

Adding and Removing sshd instances on CentOS 6

Gnuplot with Bash

Multi-Dimensional Arrays in Bash

Asciinema Notes

Notes on ownCloud configuration

Removing Chef Server Installation

Curated Downloads

Exporting WordPress to Markdown

Anatomy of Internet Bullshit

Creating a Chroot Jail for SSH Access

Late Night Rant: Facebook

GPG Encryption QSG

Late Night Rant: College Admissions Scandal

Measure DNS Server Performance

Resizing Photos for Instagram

QNAP NAS Performance Analysis

Adding and Removing sshd instances on CentOS 7

Adding and Removing sshd instances on CentOS 6

Measure DNS Server Performance

Inventory Network Services with Nmap

Finding Duplicate Photos

Maryland Renaissance Festival

Focus Stacking with Lightroom and Photoshop

Longwood Gardens, April 2018

Server and Application Resiliency Testing

System Pre-Requisites

Network Testing

Monitoring NetBackup Daemons

Identifying JAR Version Requirements

Philadelphia 2017

Verifying SNMP Connectivity on Multiple Hosts

Watching VSFTP

Sample Salt Proxy Configuration for vCenter/ESX