Benford’s Law, a curious mathematical phenomenon, asserts that in many real-world numerical datasets, the leading significant digit is more likely to be small. This observation, far from being merely intriguing, can be a powerful tool for distinguishing authentic data from manipulated or fabricated ones.
Consider the naive expectation: given nine possible leading digits, one might anticipate a uniform distribution, with each digit occurring approximately 11.1% of the time. However, Benford’s Law reveals a counterintuitive reality: the digit 1 appears as the leading digit in roughly 30% of cases, while 9 occurs less than 5% of the time. This anomaly suggests a fundamental property inherent to many real-world phenomena.
Intrigued by this discrepancy, I sought to develop a script capable of analyzing a given dataset and determining its adherence to Benford’s Law. Such a tool could be invaluable for detecting fraudulent or manipulated data. Moreover, I explored the possibility of generating synthetic datasets that exhibit the characteristic distribution of leading digits prescribed by Benford’s Law.
Then, I also thought it might be fun to write a script that would not just validate a dataset but also generate a bunch of numbers that look like they follow Benford’s law.
The first script – benfords_law_validator.sh – looks at a list of numbers you provide and decides if they follow Benford’s law and how well. The script also generates a graph that compares your data to the ideal expected distribution. Here’s a quick example:
The script
#!/bin/bash
# Check if a file is provided as an argument
if [ $# -ne 1 ]; then
echo "Usage: $0 <file_with_numbers>"
exit 1
fi
file=""
# Check if the file exists
if [ ! -f "${file}" ] || [ ! -r "${file}" ]; then
echo "File not found!"
exit 1
fi
# Extract the leading digits, count their occurrences, and sort
declare -A digit_counts
total_numbers=0
# Read through the file line by line
while read -r number; do
# Extract the first non-zero digit
first_digit=$(echo "$number" | grep -o '[1-9]' | head -n 1)
if [ -n "$first_digit" ]; then
digit_counts[$first_digit]=$((digit_counts[$first_digit] + 1))
total_numbers=$((total_numbers + 1))
fi
done < "$file"
# Expected frequencies for Benford's Law
declare -A benford_frequencies=( ["1"]=30.1 ["2"]=17.6 ["3"]=12.5 ["4"]=9.7 ["5"]=7.9 ["6"]=6.7 ["7"]=5.8 ["8"]=5.1 ["9"]=4.6 )
# Output the observed frequencies, expected frequencies, and compute the Chi-Square statistic
echo -e "Digit\tObserved(%)\tExpected(%)\tChi-Square"
chi_square=0
# Temporary file for gnuplot data
plot_data=$(mktemp)
for digit in {1..9}; do
observed_count=${digit_counts[$digit]:-0}
observed_percentage=$(awk "BEGIN { printf \"%.2f\", ($observed_count / $total_numbers) * 100 }")
expected_percentage=${benford_frequencies[$digit]}
# Calculate the expected count based on Benford's Law
expected_count=$(awk "BEGIN { printf \"%.2f\", ($expected_percentage / 100) * $total_numbers }")
# Calculate the Chi-Square component for this digit
if (( $(echo "$expected_count > 0" | bc -l) )); then
chi_square_component=$(awk "BEGIN { printf \"%.4f\", (($observed_count - $expected_count) ^ 2) / $expected_count }")
chi_square=$(awk "BEGIN { printf \"%.4f\", $chi_square + $chi_square_component }")
else
chi_square_component="N/A"
fi
echo -e "$digit\t$observed_percentage\t\t$expected_percentage\t\t$chi_square_component"
# Store data for gnuplot (digit observed expected)
echo -e "$digit $observed_percentage $expected_percentage" >> "$plot_data"
done
# Output the total Chi-Square statistic
echo -e "\nChi-Square Statistic: $chi_square"
# Provide a basic interpretation (degrees of freedom = 9 - 1 = 8 for Benford's Law test)
critical_value=15.51 # Chi-square critical value at 8 degrees of freedom, 0.05 significance level
if (( $(echo "$chi_square < $critical_value" | bc -l) )); then
echo "The dataset follows Benford's Law (Chi-Square <$critical_value)."
else
echo "The dataset does NOT follow Benford's Law (Chi-Square >=$critical_value)."
fi
# Gnuplot command to generate a dumb terminal graph
gnuplot -persist <<-EOF
set terminal dumb size 120,40
set title "Benford's Law Distribution"
set key left top
set xlabel "Digit"
set ylabel "Percentage"
set yrange [0:*]
set style data histograms
set style histogram clustered gap 1
set style fill solid 0.5 border -1
set boxwidth 0.9
plot "$plot_data" using 2:xtic(1) title "Observed" linecolor rgb "blue", \
'' using 3 title "Expected" linecolor rgb "red"
EOF
# Clean up the temporary file
/bin/rm "$plot_data"
./benfords_law_validator.sh $f
Digit Observed(%) Expected(%) Chi-Square
1 28.00 30.1 0.7326
2 16.00 17.6 0.7273
3 14.20 12.5 1.1560
4 10.00 9.7 0.0464
5 9.60 7.9 1.8291
6 7.00 6.7 0.0672
7 5.80 5.8 0.0000
8 5.40 5.1 0.0882
9 4.00 4.6 0.3913
Chi-Square Statistic: 5.0381
The dataset follows Benford's Law (Chi-Square <15.51).
Benford's Law Distribution
35 +------------------------------------------------------------+
| + + + + + + + + + |
30 |Observed +-|
|Expected |
25 |-+ * # # +-|
| * # # |
20 |-+ * # # ### +-|
15 |-+ * # # **# # +-|
| * # # * # # **### |
10 |-+ * # # * # # * # # ***## ** +-|
| * # # * # # * # # * *## **## ** |
5 |-+ * # # * # # * # # * *## **## **### **### **### ### +-|
| * # # * # # * # # * *## **## **# # * # # * # # **# # |
0 +------------------------------------------------------------+
1 2 3 4 5 6 7 8 9
Digit
The script uses the Chi-squared test, which works better with larger datasets. When I have more time, I’ll make a version that uses the Kolmogorov–Smirnov or the Kuiper test, which is better for smaller sets of data.
The second script – benfords_law_generator_mk2.sh – accepts four parameters: minimum and maximum values, the number of decimal points (just for added realism), and the number of lines to generate.
Generating a sequence that would pass the Chi-squared test is not a foolproof process. The script will run in a loop, testing each created data set until a good fit is found. Here’s a simplified example:
The script
#!/bin/bash
# Parse input arguments
low_end=$1
top_end=$2
decimal_places=$3
num_output=$4
# Check if the correct number of arguments are provided
if [ $# -ne 4 ]; then
echo "Usage: $0 <low_end> <top_end> <decimal_places> <num_output>"
exit 1
fi
# Ensure low_end and top_end are positive numbers
if ! [[ $low_end =~ ^[0-9]+(\.[0-9]+)?$ ]] || ! [[ $top_end =~ ^[0-9]+(\.[0-9]+)?$ ]] || (( $(echo "$low_end <= 0" | bc -l) )) || (( $(echo "$top_end <= 0" | bc -l) )); then
echo "Error: low_end and top_end must be positive numbers."
exit 1
fi
# Ensure decimal_places and num_output are positive integers
if ! [[ $decimal_places =~ ^[0-9]+$ ]] || ! [[ $num_output =~ ^[0-9]+$ ]]; then
echo "Error: decimal_places and num_output must be positive integers."
exit 1
fi
# Temporary file to store generated numbers
temp_file=$(mktemp)
# Calculate logarithmic bounds
min_log=$(echo "l($low_end)" | bc -l 2>/dev/null)
max_log=$(echo "l($top_end)" | bc -l 2>/dev/null)
export min_log max_log decimal_places
# Ensure logarithmic bounds are computed successfully
if [ -z "$min_log" ] || [ -z "$max_log" ]; then
echo "Error: Failed to compute logarithmic bounds. Check input values."
exit 1
fi
# Function to generate a Benford-distributed number
generate_benford_number() {
local min_log="$1"
local max_log="$2"
local decimal_places="$3"
local top_end="$4"
local low_end="$5"
# Generate a random logarithmic value that follows Benford's Law
rand_fraction=$(echo "scale=10; $RANDOM / 32767" | bc -l)
rand_log=$(echo "scale=10; $min_log + ($max_log - $min_log) * $rand_fraction" | bc -l)
# Convert the logarithmic value back to a number
benford_number=$(echo "scale=10; e($rand_log)" | bc -l)
# Ensure the number stays within bounds (this should rarely trigger)
if (( $(echo "$benford_number < $low_end" | bc -l) )); then
benford_number=$low_end
elif (( $(echo "$benford_number > $top_end" | bc -l) )); then
benford_number=$top_end
fi
# Format the number to the specified number of decimal places
printf "%.*f\n" "$decimal_places" "$benford_number"
}
export -f generate_benford_number
# Function to generate numbers
generate_numbers() {
# Use parallel processing to speed up the number generation, and save output to the temp file
seq 1 "$num_output" | xargs -n 1 -P "$(nproc)" bash -c 'generate_benford_number "$@"' _ "$min_log" "$max_log" "$decimal_places" "$top_end" "$low_end" > "$temp_file"
}
# Function to validate numbers against Benford's Law
validate_numbers() {
# Reset digit counts
declare -A digit_counts
total_numbers=0
# Read through the file line by line
while read -r number; do
# Extract the first non-zero digit
first_digit=$(echo "$number" | grep -o '[1-9]' | head -n 1)
if [ -n "$first_digit" ]; then
digit_counts[$first_digit]=$((digit_counts[$first_digit] + 1))
total_numbers=$((total_numbers + 1))
fi
done < "$temp_file"
# Expected frequencies for Benford's Law
declare -A benford_frequencies=( ["1"]=30.1 ["2"]=17.6 ["3"]=12.5 ["4"]=9.7 ["5"]=7.9 ["6"]=6.7 ["7"]=5.8 ["8"]=5.1 ["9"]=4.6 )
# Compute the Chi-Square statistic
chi_square=0
for digit in {1..9}; do
observed_count=${digit_counts[$digit]:-0}
expected_percentage=${benford_frequencies[$digit]}
# Calculate the expected count based on Benford's Law
expected_count=$(awk "BEGIN { printf \"%.2f\", ($expected_percentage / 100) * $total_numbers }")
# Calculate the Chi-Square component for this digit
if (( $(echo "$expected_count > 0" | bc -l) )); then
chi_square_component=$(awk "BEGIN { printf \"%.4f\", (($observed_count - $expected_count) ^ 2) / $expected_count }")
chi_square=$(awk "BEGIN { printf \"%.4f\", $chi_square + $chi_square_component }")
fi
done
# Check if the dataset follows Benford's Law
critical_value=15.51 # Chi-square critical value at 8 degrees of freedom, 0.05 significance level
if (( $(echo "$chi_square < $critical_value" | bc -l) )); then
return 0 # Success
else
return 1 # Failure
fi
}
# Loop until numbers follow Benford's Law
while true; do
generate_numbers # Generate numbers and store in $temp_file
if validate_numbers; then # Validate the numbers from $temp_file
cat "$temp_file" # Output the valid numbers
break
fi
done
# Clean up
/bin/rm "$temp_file"
./benfords_law_generator_mk2.sh 487 53210 2 10
5229.83
3845.22
2741.83
9649.43
624.40
7059.23
18305.62
1020.44
535.29
34400.21
Election fraud, tax evasion – the possibilities are limitless. Just keep in mind that I am not a mathematician and I won’t be visiting you in jail.
Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.