Originally published August 9, 2017 @ 9:00 pm
While I am on the subject of selecting time ranges from logs, a practical application presented itself. A lonely server in a far-away land with a habit of running out of memory could only be monitored via stunnel
and reverse SSH. The server crashed again today and, after bringing it back to life, I looked at whatever meager data the SSH monitor collected to see if there was anything useful.
The SSH monitoring script is simple: every fifteen minutes a connection is made to the server via reverse SSH tunnel and the uptime
command is executed. If connection cannot be established in a couple of minutes, an alert is dispatched. So the log file looks something like this:
12:00:08 up 2 days, 3:38, 1 user, load average: 2.81, 1.80, 2.80 12:15:09 up 2 days, 3:53, 1 user, load average: 1.15, 2.15, 2.78 12:30:10 up 2 days, 4:08, 1 user, load average: 6.30, 2.44, 2.42 12:45:07 up 2 days, 4:23, 1 user, load average: 2.38, 6.80, 5.01 13:00:06 up 2 days, 4:38, 1 user, load average: 2.57, 1.70, 2.84
I thought I saw a pattern in the system load numbers, but then it could’ve been just lack of sleep. Basically, I needed to plot the 15-minute average load and I had no fancy tools like Splunk. There are two problems: the log has no day – only hours and minutes – and the timestamp is in a timezone different from mine. So the solution was to select just those lines covering just the right number of 15-minute intervals since last midnight in the log’s local timezone.
(( d = $(date -d "1970-01-01 $(date +'%H:%MSGT')" +%s) / ( 15 * 60 ) )) tail -${d} /var/log/ssh_monitor.log | gnuplot -e "set terminal dumb 84 28; unset key; set style data labels; set xdata time; set xlabel 'Time'; set ylabel 'Load'; set autoscale; set timefmt '%H:%M'; plot '-' using 5:12:ytic(12) with lines"
And the result:
Load 6.77 ++*++-++-++-++-++-+++-++-++-++-++-++-+++-++-++-++-++-+++-++-++-++-++-++-++ + * + + + + + + + + + + | * | 6.09 ++ * * ++ 5.90 ++ * * * ++ | * * ** | 5.46 ++ * * * ** ++ | * * * * * * | | * ** * * * * | | * * * * * * * | | * * * * * * * | | * * * * * * * | | * * * * * * * | | * * * * * * * | | * * * * * * * | | * * * * * * * | 3.26 ++ * * * * * *** * * ++ 3.02 ++ *** * * * * ** * * * ++ 2.96 ++ ***** * *** * * * * *** ++ 2.65 ++ *** ** * * * * * * * * ++ 2.52 ++ * ** * * * * ** * ++ 2.20 ++ ** ** ** ** * ++ 1.94 ++ + + + ** + + * + + *+ + * ++ 1.83 ++-++-++-++-++-++-+++-++-++*++-++-++-+++-++-++-++-++-+++-++-++-++-++-++*++ 09:30:000:00:0010:30:001:00:001:30:0012:00:002:30:003:00:003:30:0014:00:00:30:00 Time
So I wasn’t imagining things: there was a pattern.
And now a bit of explanation. The following command tells you how many seconds have elapsed since midnight in the specified timezone (SGT
, in this case):
date -d "1970-01-01 $(date +'%H:%MSGT')" +%s
Then it’s a simple matter of dividing it into 15-minute intervals and then grabbing that many lines from the end of the log file. And the gnuplot
bit is pretty standard.
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.