Benford’s Law, a curious mathematical phenomenon, asserts that in many real-world numerical datasets, the leading significant digit is more likely to be small. This observation, far from being merely intriguing, can be a powerful tool for distinguishing authentic data from manipulated or fabricated ones.

Consider the naive expectation: given nine possible leading digits, one might anticipate a uniform distribution, with each digit occurring approximately 11.1% of the time. However, Benford’s Law reveals a counterintuitive reality: the digit 1 appears as the leading digit in roughly 30% of cases, while 9 occurs less than 5% of the time. This anomaly suggests a fundamental property inherent to many real-world phenomena.

Intrigued by this discrepancy, I sought to develop a script capable of analyzing a given dataset and determining its adherence to Benford’s Law. Such a tool could be invaluable for detecting fraudulent or manipulated data. Moreover, I explored the possibility of generating synthetic datasets that exhibit the characteristic distribution of leading digits prescribed by Benford’s Law.

Then, I also thought it might be fun to write a script that would not just validate a dataset but also generate a bunch of numbers that look like they follow Benford’s law.

The first script – benfords_law_validator.sh – looks at a list of numbers you provide and decides if they follow Benford’s law and how well. The script also generates a graph that compares your data to the ideal expected distribution. Here’s a quick example:

The script
#!/bin/bash

# Check if a file is provided as an argument
if [ $# -ne 1 ]; then
    echo "Usage: $0 <file_with_numbers>"
    exit 1
fi

file=""

# Check if the file exists
if [ ! -f "${file}" ] || [ ! -r "${file}" ]; then
    echo "File not found!"
    exit 1
fi

# Extract the leading digits, count their occurrences, and sort
declare -A digit_counts
total_numbers=0

# Read through the file line by line
while read -r number; do
    # Extract the first non-zero digit
    first_digit=$(echo "$number" | grep -o '[1-9]' | head -n 1)
    if [ -n "$first_digit" ]; then
        digit_counts[$first_digit]=$((digit_counts[$first_digit] + 1))
        total_numbers=$((total_numbers + 1))
    fi
done < "$file"

# Expected frequencies for Benford's Law
declare -A benford_frequencies=( ["1"]=30.1 ["2"]=17.6 ["3"]=12.5 ["4"]=9.7 ["5"]=7.9 ["6"]=6.7 ["7"]=5.8 ["8"]=5.1 ["9"]=4.6 )

# Output the observed frequencies, expected frequencies, and compute the Chi-Square statistic
echo -e "Digit\tObserved(%)\tExpected(%)\tChi-Square"

chi_square=0

# Temporary file for gnuplot data
plot_data=$(mktemp)

for digit in {1..9}; do
    observed_count=${digit_counts[$digit]:-0}
    observed_percentage=$(awk "BEGIN { printf \"%.2f\", ($observed_count / $total_numbers) * 100 }")
    expected_percentage=${benford_frequencies[$digit]}
    
    # Calculate the expected count based on Benford's Law
    expected_count=$(awk "BEGIN { printf \"%.2f\", ($expected_percentage / 100) * $total_numbers }")
    
    # Calculate the Chi-Square component for this digit
    if (( $(echo "$expected_count > 0" | bc -l) )); then
        chi_square_component=$(awk "BEGIN { printf \"%.4f\", (($observed_count - $expected_count) ^ 2) / $expected_count }")
        chi_square=$(awk "BEGIN { printf \"%.4f\", $chi_square + $chi_square_component }")
    else
        chi_square_component="N/A"
    fi
    
    echo -e "$digit\t$observed_percentage\t\t$expected_percentage\t\t$chi_square_component"
    
    # Store data for gnuplot (digit observed expected)
    echo -e "$digit $observed_percentage $expected_percentage" >> "$plot_data"
done

# Output the total Chi-Square statistic
echo -e "\nChi-Square Statistic: $chi_square"

# Provide a basic interpretation (degrees of freedom = 9 - 1 = 8 for Benford's Law test)
critical_value=15.51 # Chi-square critical value at 8 degrees of freedom, 0.05 significance level
if (( $(echo "$chi_square < $critical_value" | bc -l) )); then
    echo "The dataset follows Benford's Law (Chi-Square <$critical_value)."
else
    echo "The dataset does NOT follow Benford's Law (Chi-Square >=$critical_value)."
fi

# Gnuplot command to generate a dumb terminal graph
gnuplot -persist <<-EOF
    set terminal dumb size 120,40
    set title "Benford's Law Distribution"
    set key left top
    set xlabel "Digit"
    set ylabel "Percentage"
    set yrange [0:*]
    set style data histograms
    set style histogram clustered gap 1
    set style fill solid 0.5 border -1
    set boxwidth 0.9
    plot "$plot_data" using 2:xtic(1) title "Observed" linecolor rgb "blue", \
         '' using 3 title "Expected" linecolor rgb "red"
EOF

# Clean up the temporary file
/bin/rm "$plot_data"
./benfords_law_validator.sh $f

Digit   Observed(%)     Expected(%)     Chi-Square
1       28.00           30.1            0.7326
2       16.00           17.6            0.7273
3       14.20           12.5            1.1560
4       10.00           9.7             0.0464
5       9.60            7.9             1.8291
6       7.00            6.7             0.0672
7       5.80            5.8             0.0000
8       5.40            5.1             0.0882
9       4.00            4.6             0.3913

Chi-Square Statistic: 5.0381
The dataset follows Benford's Law (Chi-Square <15.51).

                      Benford's Law Distribution

  35 +------------------------------------------------------------+
     |     +     +     +     +      +     +     +     +     +     |
  30 |Observed                                                  +-|
     |Expected                                                    |
  25 |-+ * # #                                                  +-|
     |   * # #                                                    |
  20 |-+ * # #   ###                                            +-|
  15 |-+ * # # **# #                                            +-|
     |   * # # * # # **###                                        |
  10 |-+ * # # * # # * # # ***##  **                            +-|
     |   * # # * # # * # # * *##  **##  **                        |
   5 |-+ * # # * # # * # # * *##  **##  **### **### **###   ### +-|
     |   * # # * # # * # # * *##  **##  **# # * # # * # # **# #   |
   0 +------------------------------------------------------------+
           1     2     3     4      5     6     7     8     9
                                 Digit

The script uses the Chi-squared test, which works better with larger datasets. When I have more time, I’ll make a version that uses the Kolmogorov–Smirnov or the Kuiper test, which is better for smaller sets of data.

The second script – benfords_law_generator_mk2.sh – accepts four parameters: minimum and maximum values, the number of decimal points (just for added realism), and the number of lines to generate.

Generating a sequence that would pass the Chi-squared test is not a foolproof process. The script will run in a loop, testing each created data set until a good fit is found. Here’s a simplified example:

The script
#!/bin/bash

# Parse input arguments
low_end=$1
top_end=$2
decimal_places=$3
num_output=$4

# Check if the correct number of arguments are provided
if [ $# -ne 4 ]; then
    echo "Usage: $0 <low_end> <top_end> <decimal_places> <num_output>"
    exit 1
fi

# Ensure low_end and top_end are positive numbers
if ! [[ $low_end =~ ^[0-9]+(\.[0-9]+)?$ ]] || ! [[ $top_end =~ ^[0-9]+(\.[0-9]+)?$ ]] || (( $(echo "$low_end <= 0" | bc -l) )) || (( $(echo "$top_end <= 0" | bc -l) )); then
    echo "Error: low_end and top_end must be positive numbers."
    exit 1
fi

# Ensure decimal_places and num_output are positive integers
if ! [[ $decimal_places =~ ^[0-9]+$ ]] || ! [[ $num_output =~ ^[0-9]+$ ]]; then
    echo "Error: decimal_places and num_output must be positive integers."
    exit 1
fi

# Temporary file to store generated numbers
temp_file=$(mktemp)

# Calculate logarithmic bounds
min_log=$(echo "l($low_end)" | bc -l 2>/dev/null)
max_log=$(echo "l($top_end)" | bc -l 2>/dev/null)

export min_log max_log decimal_places

# Ensure logarithmic bounds are computed successfully
if [ -z "$min_log" ] || [ -z "$max_log" ]; then
    echo "Error: Failed to compute logarithmic bounds. Check input values."
    exit 1
fi

# Function to generate a Benford-distributed number
generate_benford_number() {
    local min_log="$1"
    local max_log="$2"
    local decimal_places="$3"
    local top_end="$4"
    local low_end="$5"

    # Generate a random logarithmic value that follows Benford's Law
    rand_fraction=$(echo "scale=10; $RANDOM / 32767" | bc -l)
    rand_log=$(echo "scale=10; $min_log + ($max_log - $min_log) * $rand_fraction" | bc -l)

    # Convert the logarithmic value back to a number
    benford_number=$(echo "scale=10; e($rand_log)" | bc -l)

    # Ensure the number stays within bounds (this should rarely trigger)
    if (( $(echo "$benford_number < $low_end" | bc -l) )); then
        benford_number=$low_end
    elif (( $(echo "$benford_number > $top_end" | bc -l) )); then
        benford_number=$top_end
    fi

    # Format the number to the specified number of decimal places
    printf "%.*f\n" "$decimal_places" "$benford_number"
}

export -f generate_benford_number

# Function to generate numbers
generate_numbers() {
    # Use parallel processing to speed up the number generation, and save output to the temp file
    seq 1 "$num_output" | xargs -n 1 -P "$(nproc)" bash -c 'generate_benford_number "$@"' _ "$min_log" "$max_log" "$decimal_places" "$top_end" "$low_end" > "$temp_file"
}

# Function to validate numbers against Benford's Law
validate_numbers() {
    # Reset digit counts
    declare -A digit_counts
    total_numbers=0

    # Read through the file line by line
    while read -r number; do
        # Extract the first non-zero digit
        first_digit=$(echo "$number" | grep -o '[1-9]' | head -n 1)
        if [ -n "$first_digit" ]; then
            digit_counts[$first_digit]=$((digit_counts[$first_digit] + 1))
            total_numbers=$((total_numbers + 1))
        fi
    done < "$temp_file"

    # Expected frequencies for Benford's Law
    declare -A benford_frequencies=( ["1"]=30.1 ["2"]=17.6 ["3"]=12.5 ["4"]=9.7 ["5"]=7.9 ["6"]=6.7 ["7"]=5.8 ["8"]=5.1 ["9"]=4.6 )

    # Compute the Chi-Square statistic
    chi_square=0
    for digit in {1..9}; do
        observed_count=${digit_counts[$digit]:-0}
        expected_percentage=${benford_frequencies[$digit]}
        
        # Calculate the expected count based on Benford's Law
        expected_count=$(awk "BEGIN { printf \"%.2f\", ($expected_percentage / 100) * $total_numbers }")
        
        # Calculate the Chi-Square component for this digit
        if (( $(echo "$expected_count > 0" | bc -l) )); then
            chi_square_component=$(awk "BEGIN { printf \"%.4f\", (($observed_count - $expected_count) ^ 2) / $expected_count }")
            chi_square=$(awk "BEGIN { printf \"%.4f\", $chi_square + $chi_square_component }")
        fi
    done

    # Check if the dataset follows Benford's Law
    critical_value=15.51 # Chi-square critical value at 8 degrees of freedom, 0.05 significance level
    if (( $(echo "$chi_square < $critical_value" | bc -l) )); then
        return 0 # Success
    else
        return 1 # Failure
    fi
}

# Loop until numbers follow Benford's Law
while true; do
    generate_numbers   # Generate numbers and store in $temp_file
    if validate_numbers; then  # Validate the numbers from $temp_file
        cat "$temp_file"  # Output the valid numbers
        break
    fi
done

# Clean up
/bin/rm "$temp_file"
./benfords_law_generator_mk2.sh 487 53210 2 10

5229.83
3845.22
2741.83
9649.43
624.40
7059.23
18305.62
1020.44
535.29
34400.21

Election fraud, tax evasion – the possibilities are limitless. Just keep in mind that I am not a mathematician and I won’t be visiting you in jail.