Originally published November 26, 2020 @ 1:25 pm
Over the years I’ve been using Salt CLI extensively for day-to-day system administration tasks and, in my opinion, there’s nothing quite like it. Faster, more capable, and easier to learn than the rest of the CAPS.
My interest in SaltStack generally is confined to its remote execution capabilities. As a sysadmin with HPC cluster background, I know the challenges of managing a large number of dissimilar servers. Using old tools like pdsh
is an option, but even the simplest of tasks – such as running a basic command on all servers – immediately presents the first challenge: getting the list of servers up to date.
Salt’s advanced targeting, remote command, and script execution capabilities are very attractive, even if you have no interest in using Salt for server deployment and centralized configuration management. Much of what you will find on this site will be limited to remote execution and monitoring tasks, and I will not get much into the Puppet-vs-Salt-vs-Ansible-vs-Chef debate.
If you support a lot of servers, the trivial effort it takes to set up a Salt master server and to deploy the agents is well justified. Similar to Ansible, Salt can work without the agents, via SSH, but why deny yourself the convenience of near-instantaneous response from thousands of managed systems? And Salt agents work on Windows as well, if you’re into that sort of thing.
Unlike agentless server provisioning tools, with Salt, you don’t need to maintain lists and complex group hierarchies of managed nodes. You can target the systems you need on the fly using a myriad of different options you get with Salt. None of that Ansible slowness while “gathering facts”. Of course, maybe if you get paid by the hour…
In an attempt to be concise, I will go straight to troubleshooting the four most common problems I’ve encountered running Salt in a large environment (and none of these problems are really with Salt itself).
- Salt agents maintain a runtime cache in
/var/cache/salt
. The/var
filesystem tends to run out of space because that’s where/var/log
is. When this happens, thesalt-minion
stops working. I wish it had an option of designating a secondary cache location.For now, the solution is to usesalt-ssh
to access the problem nodes and clean up/var
. I mean, you would have to do this anyway. And, by the way, pretty much all of thesalt .* cmd.run ...
commands you will see below also work assalt-ssh .* cmd.run ...
, as long as you have your passwordless SSH configured the same way you would with, say, Ansible. - From time to time virtualization clusters lose access to storage. Why? It’s complicated and right now not really important. What is important, however, is when this happens, local filesystems tend to become read-only. This may include
/var
. I am sure that now you can guess how this affects the Salt agent. - Some people have a bad habit of cloning VMs instead of using whatever VM deployment process that they should’ve been using. When they clone VMs they invariably forget to update
/etc/salt/minion_id
file that usually contains the node’s FQDN. This, understandably, causes some confusion.Once again, you can usesalt-ssh
to automate a quick process that will periodically scan your environment to identify and fix this issue: just delete theminion_id
file and bounce thesalt-minion
service. And then you can deal with the real culprits personally. - Finally, and I should’ve mentioned this first, make sure your firewall rules are in place to work with Salt. I can’t tell you how many times this happened: new VLAN is created, standard firewall rules not implemented. The minions talk to the master via an encrypted ZMQ connection on ports 4505 and 4506, so these need to be opened from minions to the master (or to the Salt proxies if you use those). Additionally, port 22 should be opened from the master to the minions if you plan on using
salt-ssh
(and you should have that option).
I have to admit that, when it comes to scripting, I have a soft spot for the convoluted. Here’s a characteristic example:
Imagine you have four environments – Dev, QA, UAT, and Prod – and you need to compare the CPU utilization of all Tomcat servers by the environment. Let’s also imagine that your host-naming convention looks something like this: devl-tomcat-01
or qal-tomcat-01
, where “l” after the environment abbreviation stands for “Linux”. This is just to help you understand the mess below.
j=tomcat ; for i in prodl devl uatl qal ; do echo "" ; echo "CPU utilization summary for ${i}*${j}* nodes" ; salt --timeout=30 --output=txt "${i}*${j}*" cmd.run "top -b -n 1" | egrep ': Cpu' | printf "%s %s %s %s %s %s %s %s\n" `grep -oE '[0-9]{1,}\.[0-9]{1,2}'` | awk '{ c++ ; sum_us += $1 ; sum_sy += $2 ; sum_ni += $3 ; sum_id += $4 ; sum_wa += $5 ; sum_hi += $6 ; sum_si += $7 ; sum_st += $8 ; a_us[c] = $1 ; a_sy[c] = $2 ; a_ni[c] = $3 ; a_id[c] = $4 ; a_wa[c] = $5 ; a_hi[c] = $6 ; a_si[c] = $7 ; a_st[c] = $8 } END { c=asort(a_us) ; d=asort(a_sy) ; e=asort(a_ni) ; f=asort(a_id) ; g=asort(a_wa) ; h=asort(a_hi) ; k=asort(a_si) ; l=asort(a_st) ; printf "avg_us\tavg_sy\tavg_ni\tavg_id\tavg_wa\tavg_hi\tavg_si\tavg_st\tmax_us\tmax_sy\tmax_ni\tmax_id\tmax_wa\tmax_hi\tmax_si\tmax_st\n" ; printf ( "%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\t%.0f%\n", sum_us/c, sum_sy/c, sum_ni/c, sum_id/c, sum_wa/c, sum_hi/c, sum_si/c, sum_st/c, a_us[c], a_sy[d], a_ni[e], a_id[f], a_wa[g], a_hi[h], a_si[k], a_st[l] ) }' | column -t ; echo "" ; done CPU utilization summary for prodl*tomcat* nodes avg_us avg_sy avg_ni avg_id avg_wa avg_hi avg_si avg_st max_us max_sy max_ni max_id max_wa max_hi max_si max_st 2% 0% 0% 97% 0% 0% 0% 0% 33% 1% 1% 99% 2% 0% 0% 0% CPU utilization summary for devl*tomcat* nodes avg_us avg_sy avg_ni avg_id avg_wa avg_hi avg_si avg_st max_us max_sy max_ni max_id max_wa max_hi max_si max_st 3% 0% 1% 96% 0% 0% 0% 0% 18% 1% 2% 100% 1% 0% 0% 0% CPU utilization summary for uatl*tomcat* nodes avg_us avg_sy avg_ni avg_id avg_wa avg_hi avg_si avg_st max_us max_sy max_ni max_id max_wa max_hi max_si max_st 1% 0% 1% 97% 0% 0% 0% 0% 3% 1% 2% 100% 1% 0% 0% 0% CPU utilization summary for qal*tomcat* nodes avg_us avg_sy avg_ni avg_id avg_wa avg_hi avg_si avg_st max_us max_sy max_ni max_id max_wa max_hi max_si max_st 2% 0% 1% 96% 0% 0% 0% 0% 14% 0% 1% 99% 1% 0% 0% 0%
The simple truth of what I do with Salt becomes evident. From everything above, this is the only Salt command:
salt --timeout=30 --output=txt "${i}*${j}*" cmd.run "top -b -n 1"
That’s it. The rest is just shell scripting. Having said that, accomplishing the same task with pdsh
would have been quite a bit more complicated.
So why don’t I use Salt to the fullest of its abilities? To make a short story long, back in the earlier days of HPC clusters I’ve been using Scali, which became Platform Manager in 2007, and after five more years was acquired by IBM. Scali was an interested but unstable and poorly-documented tool. It had excellent core functionality, but this advantage has been entirely undermined by many buggy features of dubious value.
I am certain I’ve spent more time fixing issues with Scali itself than it would have taken me to deploy and manage my HPC clusters using Clonezilla and pdsh
. And I would have done exactly that, has it not been for my management’s determination to continue using Scali/PM/whatever, since they already spend the money on licensing.
I can deploy servers faster with scripts and FTP than most DevOps folks can with Puppet and Ansible. I can provide much more responsive and flexible configuration management using pdsh
and flat files than most automation guys can with SaltStack or Chef. So, if I can do all this right now knowing what I already know, why bother with anything else, unless there’s some clear advantage?
There are many DevOps engineers out there who have never heard of Scali or even pdsh
. They truly believe they’re onto something new here with their Ansible, Jenkins, OpenShift, and endless layers of virtualization. As a Russian saying goes, everything new is just well-forgotten old. Having said that, I do recognize a superior tool when I see one and so here we go.
Basic Operations
Run a command on QA nodes
salt --output=txt --timeout=30 "qa*" cmd.run "service ntpd status 2>/dev/null" 2>/dev/null
Identify QA nodes with local filesystem utilization above 95%
salt --output=txt --timeout=30 "qa*" cmd.run "df -hPl | egrep -E '9[5-9]%|100%'" 2>/dev/null | column -t
Identify QA nodes that haven’t been rebooted in the past week
salt --output=txt --timeout=30 "qa*" cmd.run "uptime" | grep -E " ([1-9]{2,}|[8-9]) days" | awk -F, '{print $1}' | column -t
Identify QA nodes with security advisories (RHEL)
salt --output=txt --timeout=30 "qa*" cmd.run "yum updateinfo summary 2>/dev/null" | grep Security | column -t
Get RHEL version of QA nodes
salt --output=txt --timeout=30 "qa*" cmd.run "grep -oE '[0-9]{1,}\.[0-9]{1,}' /etc/redhat-release 2>/dev/null" | column -t
Running commands as another user on QA Tomcat servers
salt --timeout=30 --output=txt "qa-tomcat*" cmd.run "su - tomcat bash -c 'whoami'"
Get a list of physical servers and their hardware models, sorted by generation (HPE)
salt --timeout=30 --output=txt -G "virtual:physical" cmd.run "dmidecode 2>/dev/null | grep -m1 'Product Name:'" 2>/dev/null | awk -F: '{print $1": "$NF}' | sed 's/Gen/G/g' | sort -k4 | column -t
Get a list of unique subnets used by Salt minions
salt --output=txt "qa*" cmd.run "ifconfig 2>/dev/null" 2>/dev/null | grep -oE "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" | grep -E "(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" | sort -u
Advanced Operations
Check for Puppet errors on QA nodes
salt --timeout=30 --output=txt "qa*" cmd.run "grep -c 'Puppet::Error' /var/log/messages" | grep [1-9]$ | column -t
Get the total number of CPU cores across all 32-bit nodes
salt --timeout=30 --output=txt -G "osarch:i686" cmd.run "cat /proc/cpuinfo" | cut -d ' ' -f2- | grep -c ^processor | awk '{ SUM += $1} END { print ( SUM )" cores" }'
Compound matching, count VMs:
salt --output=txt -C "* and not G@virtual:physical" test.ping 2>/dev/null | wc -l
Get the total LVM size across all DEV WebLogic nodes
salt --timeout=30 --output=txt "dev*logic*" cmd.run "vgs --units=k 2>/dev/null" | cut -d ' ' -f3- | awk '{print $6}' | grep -oE "[0-9]{1,100}\.[0-9]{2}" | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'
Show total memory allocated to all DEV Tomcat nodes
salt --timeout=30 --output=txt "dev*tomcat*" cmd.run "free -k | grep ^Mem:" 2>/dev/null | cut -d ' ' -f3- | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'
See which DEV nodes are favored by a particular user
salt --timeout=30 --output=txt "dev*" cmd.run "last jdoe | grep -c ^jdoe" 2>/dev/null | sort -k2 -rn | head -10 | column -t
Find RHEL 6 nodes with the highest swap utilization
salt --timeout=30 --output=txt -G "osfinger:Red*6" cmd.run "free -k | grep ^Swap:" | awk '{print $1"\t"$4}' | grep -vE "0{1}$" 2>/dev/null | sort -k2 -rn | column -t
Salt understands POSIX regex expressions
salt --timeout=30 --output=txt -E 'qa-web.*(0?[0-9]{1,2}).domain.*' cmd.run "uptime" 2>/dev/null
Salt can read a list of nodes from a file
salt --output=txt --timeout=30 -L "`cat /var/tmp/one-per-line-fqdns`" cmd.run "grep -m1 release /etc/issue 2>/dev/null" 2>/dev/null
Similar to above, but the hostnames are not FQDN
salt --output=txt --timeout=30 -L "`sed -e 's/$/\.domain\.local/' /tmp/one-per-line-short-hostnames`" cmd.run "grep release /etc/issue 2>/dev/null" 2>/dev/null
Salt can read a list of nodes from CLI
salt --output=txt --timeout=30 -L qa-tomcat-01.domain.local,dev-tomcat-01.domain.local cmd.run "grep release /etc/issue 2>/dev/null" 2>/dev/null
Salt can use Boolean operators
salt --output=txt -C "[pdqu]l-tomcat* or [pdqu]l-weblogic*" cmd.run "logrotate -f /etc/logrotate.conf 2>/dev/null" 2>/dev/null
Targeting minions using “salt grains”
salt --timeout=30 --output=txt -G 'virtual:physical' cmd.run "uname -a" salt --timeout=30 --output=txt -G 'manufacturer:HP' cmd.run "uname -a" salt --timeout=30 --output=txt -G 'cpuarch:x86_64' cmd.run "uname -a" salt --timeout=30 --output=txt -G 'os:RedHat' cmd.run "uname -a"
Target minions by subnet
salt -S 10.92.136.0/24 cmd.run "uptime"
Target minions by subnet and Salt grains
salt -C 'S@10.92.136.0/24 and G@os:RedHat' cmd.run "uptime"
Identify PROD servers with 15-min load average in double-digits:
salt --timeout=30 --output=txt "prod*" cmd.run "uptime | egrep -E '[0-9]{2}\.[0-9]{2}$'" 2>/dev/null
Get a list of NFS mounts on QA nodes
salt --timeout=30 --output=txt "qa*" cmd.run "grep ' nfs ' /etc/mtab 2>/dev/null" 2>/dev/null | awk -F',' '{print $1}' | awk '{print $1" "$2" "$3" "$5}' | column -t
Install Wireshark on minions that don’t already have it
salt --output=txt "qltc*" cmd.run "which tshark 2>/dev/null 1>&2 || yum -y install wireshark 2>/dev/null 1>&2" 2>/dev/null
Find “/opt” filesystem on “prod-db*” servers with the utilization of 80-100%
salt --output=txt "prod-db*" cmd.run "df -hlP /opt 2>/dev/null" 2>/dev/null | egrep "(([8-9][0-9])|100)\%" | awk '{print $1,$2,$5,$6}' | sort -u | column -t
Get allocated LVM size of minions in the “/root/list” containing short hostnames
salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "vgs --units=k 2>/dev/null" 2>/dev/null | cut -d ' ' -f3- | awk '{print $6}' | grep -oE "[0-9]{1,100}\.[0-9]{2}" | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'
Get total memory size of minions in the “/root/list” containing short hostnames
salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "free -k 2>/dev/null | grep ^Mem:" 2>/dev/null | cut -d ' ' -f3- | awk '{ SUM += $1} END { print ( SUM/1024/1024 )" GB" }'
Get the total number of CPU cores of minions in the “/root/list” containing short hostnames
salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "cat /proc/cpuinfo" 2>/dev/null | cut -d ' ' -f2- | grep -c ^processor | awk '{ SUM += $1} END { print ( SUM )" cores" }'
Get a list of mounted NFS shares for minions in the “/root/list” containing short hostnames
salt --timeout=30 --output=txt -L "`sed -e 's/$/\.domain\.local/' /root/list`" cmd.run "cat /etc/mtab" 2>/dev/null | grep " nfs " | awk -F',' '{print $1}' | awk '{print $1" "$2" "$3" "$5}' | sed -e 's/:\s/\t/g' | column -t | sort -u
Display HP ILO IP address on all physical hosts using ipmitool
salt --output=txt -G "virtual:physical" cmd.run "ipmitool lan print 2>/dev/null | grep -E '^IP Address\s\s' | awk '{print $NF}'" 2>/dev/null
Show top ten UAT nodes by CPU utilization for the “java” process running under “weblogic” username
salt --output=txt "qa*" cmd.run "top -b -n 1 -d 1 -u weblogic 2>/dev/null | grep [j]ava" 2>/dev/null | column -t | sort -k10 -rn | head -10
Similar to the previous example, but sorted by memory utilization
salt --output=txt "qa*" cmd.run "top -b -n 1 -d 1 -u weblogic 2>/dev/null | grep [j]ava" 2>/dev/null | column -t | sort -k11 -rh | head -10
Copying Files and Folders
Copy a single file from the Salt master to minions
The source location must be inside Salt server home (i.e. /srv/salt/
). Also, the target folder must already exist on the minions (in this example, /var/tmp
). Also, keep in mind that for security reasons, any file copied from the Salt master to a minion will have its permissions changed to 644. You will need to run another Salt command to set the desired permissions on the file.
salt "prod-weblogic*" cp.get_file salt://scripts/app_migrate.sh /var/tmp/app_migration.sh
Copy a directory with contents from master to minions
This example will copy the contents of /srv/salt/myfiles/
to the minion:/tmp/myfiles
salt 'prod-weblogic*' cp.get_dir salt://myfiles /tmp
Copy contents of a directory from minions to the master
The data you copy will end up on the Salt master in /var/cache/salt/master/minions/<minion_id>/files/folder/on/minions
(for the example below), so be mindful of the available disk space.
salt "*" cp.push_dir /folder/on/minions/
Running Scripts via Salt
The scripts are usually located on the Salt master server in /srv/salt
(on RHEL). They don’t need to be executable.
Basic Syntax
Run yum_health_check.sh
script on all of the QA servers:
salt --timeout=30 --output=txt "qa*" cmd.script "salt://scripts/yum_health_check.sh"
Cleaning up output
Text output of cmd.script
is very busy and not very readable. I found it useful to create the following /usr/bin/clean
helper script that will sanitize the output of cmd.script
:
#!/bin/bash sed -r "s/: \{u'/^/g" | sed -r "s/u'stdout': u'/^/g" | sed -r "s/'}$/^/g" | \ sed -r "s/\\t/ /g" | sed -r "s/\\n/{/g" | grep -v ', ^^' | \ awk -F'^' '{print $1" "$3}' | while read line; do if [ $(echo "${line}" | grep -c '{') -gt 0 ]; then h="$(echo ${line} | awk '{print $1}')" echo "${line}" | \ awk '{ s = ""; for (i = 2; i <= NF; i++) s = s $i " "; print s }' | \ sed 's/{/\n/g' | while read line2; do echo "${h} ${line2}"; done else echo "${line}" fi done | sort -k1V | column -t 2>/dev/null
And here’s how to use it with a script:
salt "qa*" cmd.script "salt://scripts/yum_health_check.sh" | clean
Passing arguments to a script
In this example script app_find.sh
will try to locate an application called “webstore01” on various WebLogic servers:
salt --timeout=30 --output=txt "[pdqu]l*weblogic*" cmd.script "salt://scripts/app_find.sh" args="webstore01" | clean
Salt-Cloud Operations
This is a brief listing of the more common salt-cloud
commands with some VMWare-specific examples. You can see all available salt-cloud
commands here.
Working with VMWare
The primary configuration file that allows salt-cloud
to interact with VMWare is usually located in /etc/salt/cloud.providers.d/vmware.conf
and looks something like this:
vcenter01: driver: vmware user: 'DOMAIN\vCenterSvcAccount' password: 'P@ssw0rd1' url: 'vcenter01.domain.local' protocol: 'https' port: 443 esxi_host_user: 'root' esxi_host_password: 'P@ssw0rd2' vcenter02: driver: vmware user: 'DOMAIN\vCenterSvcAccount' password: 'P@ssw0rd1' url: 'vcenter02.domain.local' protocol: 'https' port: 443 esxi_host_user: 'root' esxi_host_password: 'P@ssw0rd2'
There is one important detail to keep in mind: with VMWare salt-cloud
has two modes of operation – vmware
and vsphere
. The two are similar for the most part. However, in the vsphere
mode you need to specify the name of the vCenter. In the vmware
mode, the desired operation will be performed on all configured vCenters.
Let’s say you want to create a snapshot of a VM called prod-tomcat-test-01
. The following command will do this for you:
salt-cloud -a create_snapshot prod-tomcat-test-01 snapshot_name="prod-tomcat-test-01_$(date +'%F')" description="Test snapshot made by $(whoami)@$(hostname) on $(date +'%F')"
Notice how I did not specify the name of the vCenter. Salt already knows which vCenter has this VM. However, if more than one vCenter has a VM with that name, snapshots will be created for all VMs matching the name.
List configured cloud providers
salt-cloud --out=json --list-providers
List clusters in a vCenter
salt-cloud --out=compact -f list_clusters
List VM build profiles
salt-cloud --out=nested --list-profiles
List VMs in a vCenter
salt-cloud --out=nested --query
List all VM snapshots in a vCenter
salt-cloud --out=json -f list_snapshots
Use jq
to parse output
Similar to the previous example, but we extract only the names of VMs that have snapshots.
salt-cloud --out=json -f list_snapshots | jq -r '.[]|.[]|keys' | grep -oP "(?<=\").*(?=\",)"
List snapshots for a particular VM
salt-cloud --out=json -f list_snapshots name=""
Create a snapshot
Note: by default,salt-cloud
will prompt you for confirmation before modifying a VMs configuration state, a snapshot, an ESX host, a cluster, or a vCenter. You can use – carefully – the --assume-yes
or -y
option to assume an affirmative response and proceed with the operation.
salt-cloud --out=json -y -a create_snapshot snapshot_name="before_patching_$(date +'%F')" description="Test snapshot made by $(whoami)@$(hostname) on $(date +'%F')"
Revert to a snapshot
salt-cloud --out=json -y -a revert_to_snapshot snapshot_name=""
Merge all snapshots for a VM
salt-cloud --out=json -y -a remove_all_snapshots merge_snapshots=True
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.