The issue of overused words is a tricky one. For me, the problem occurs when I revisit my writing a few days later and make a few edits. I may find a word I would like to replace with a better match, forgetting that the same word was already used a couple of times elsewhere in the text.

I’ve been using Grammarly for some time now, and lately, I’ve also tried out TypeAI. Both can help you with syntax, punctuation, and overall style. However, these tools do not do well in keeping track of overused words.

If you use ChatGPT or another LLM AI platform to help you with your writing or editing, the problem of overused words becomes even more pronounced. Using word and phrase frequency analysis is one of the easiest ways of spotting AI-generated content. While it may be “important to consider” things and “delve into the intricacies” of them, overusing certain words and phrases may also reveal your dirty little AI secret.

This Bash script below (also available in my GitHub repo) is designed to analyze a text file to identify and enhance vocabulary by highlighting overused words and suggesting synonyms. It uses a combination of shell commands and a Python script to accomplish its tasks.

Script Breakdown:

  1. Input Validation:
    • The script starts by checking if an input file ($infile) was provided. If not, it displays the correct usage and exits.
    • It then checks if the file exists. If the file is not found, it will output an error message and exit.
  2. Download Common Words List:
    • A temporary file is created to store a list of common words downloaded from a specified URL using curl. If the download fails or the file is empty, the script reports an error and exits.
  3. Combine Word Lists:
    • If a custom list of common words exists in the user’s home directory, it is appended to the downloaded list. The combined list is then sorted to keep only unique entries.
  4. Python Script for Word Analysis:
    • A Python script is embedded within the Bash script. It processes the text from the input file using the Natural Language Toolkit (nltk) to tokenize the text and filter out common words.
    • The script counts occurrences of the remaining words, focusing on overused words, and prints them in formatted output.
  5. Output Enhancement with WordNet:
    • If WordNet (wn) is installed on the system, the script uses it to find synonyms for each overused word. It processes the synonyms to display only the first ten unique synonyms for each word for multiple word senses.
    • The final output includes the overused word, its count, and a list of suggested synonyms.
  6. Cleanup:
    • Temporary files created during the script’s execution are removed to clean up the working environment.

Usage:

This script is ideal for writers, editors, and anyone else looking to refine their text for redundancy or enhance their vocabulary. It requires Python with nltk installed and optionally uses WordNet for synonym suggestions.

Note:

To run this script, save it to a file, make it executable (chmod +x scriptname.sh), and run it by passing a text file as an argument (./scriptname.sh filename.txt).

ScriptSample Output
Script
#!/bin/bash
infile=$1
if [ -z "$infile" ]; then
    echo "Usage: $0 <infile>"
    exit 1
fi
if [ ! -f "$infile" ]; then
    echo "Error: $infile not found."
    exit 1
fi
common_word_list_01="$(mktemp)"
common_word_list_01_url="https://gist.githubusercontent.com/igoros777/e6ae5761ef6635c61eb9bed661f0d0c1/raw/98d35708fa344717d8eee15d11987de6c8e26d7d/1-1000.txt"
curl -m10 -k -s0 -o "$common_word_list_01" "$common_word_list_01_url"
if [ ! -s "$common_word_list_01" ]; then
    echo "Error: failed to download $common_word_list_01_url"
    exit 1
fi
tmpfile="$(mktemp)"

common_word_list_custom_01="$HOME/common_word_list_custom_01.txt"
if [ -f "$common_word_list_custom_01" ]; then
    cat "$common_word_list_custom_01" >> "$common_word_list_01"
    sort -u -o "$common_word_list_01" "$common_word_list_01"
fi

python3 <<EOF | column -t > "${tmpfile}"
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

try:
  nltk.data.find('tokenizers/punkt')
except LookupError:
  nltk.download('punkt')

with open('$infile', 'r') as f:
    text = f.read().lower()

with open('$common_word_list_01', 'r') as f:
    common_words = f.read().splitlines()

tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.isalpha() and not word in common_words]
counter = Counter(filtered_tokens)
overused_words = counter.most_common()

grouped_words = {}
for word, count in overused_words:
    if count in grouped_words:
        grouped_words[count].append(word)
    else:
        grouped_words[count] = [word]

for count in sorted(grouped_words.keys(), reverse=True):
    words = sorted(grouped_words[count])
    for word in words:
      if len(word) > 3 and count >= 2:
        print(f'{word}: {count}')
EOF

WN="$(which wn)"
if [ -z "$WN" ]; then
    cat "${tmpfile}"
else
    while read -r line; do
        word=$(echo "$line" | awk -F: '{print $1}')
        synonyms="$(${WN} "${word}" -synsn -synsv -synsr -synsa | 
            awk '/^Sense [0-9]+/{getline; split($0, a, " "); print a[1], a[2], a[3], a[4]}' | 
            sed 's/,$//g' | 
            awk 'BEGIN{RS=ORS="\n"} 
                { 
                    gsub(/^ +| +$/, "", $0); 
                    n = split($0, words, ", "); 
                    for (i=1; i<=n; i++) {
                        word = words[i];
                        if (!(word in seen)) {
                            seen[word]=1; 
                            print word;
                            if (++count == 10) exit;
                        }
                    }
                }' | paste -sd, - | sed 's/,/, /g')"
        echo -e "${line}\t${synonyms}"
    done < "${tmpfile}"
fi


/bin/rm -f "$common_word_list_01" "$tmpfile"

Sample Output
./word_frequency_analyzer.sh /tmp/text.txt

script:      13 script, book, playscript, handwriting, hand
words:       10 words, lyric, language, quarrel, wrangle, row, actor's line, speech, word, news
file:        9  file, data file, single file, Indian, file cabinet, filing, register, charge, lodge, file away
overused:    6  overuse, overdrive
synonyms:    5  synonym, equivalent word
text:        5  text, textual matter, textbook, text edition
output:      4  end product, output, yield, output signal, production, outturn, turnout
python:      4  python, Python
using:       4  exploitation, victimization, victimisation, using, use, utilize, utilise, apply, habituate, expend
input:       3  input signal, input, remark, comment, stimulation, stimulus, stimulant
uses:        3  use, usage, utilization, utilisation, function, purpose, role, consumption, economic consumption, usance
wordnet:     3  wordnet, WordNet, Princeton WordNet
analysis:    2  analysis, analytic thinking, psychoanalysis, depth psychology
bash:        2  knock, bash, bang, smash, do, brawl, sock, bop, whop, whap
created:     2  make, create, produce
download:    2  download
downloaded:  2  download
enhance:     2  enhance, heighten, raise
error:       2  mistake, error, fault, erroneousness, erroneous belief, misplay, wrongdoing, computer error
exists:      2  exist, be, survive, live, subsist
exits:       2  exit, issue, outlet, way, passing, loss, departure, go out, get, die
installed:   2  install, instal, put in, set up
nltk:        2
processes:   2  procedure, process, cognitive process, mental, summons, unconscious process, outgrowth, appendage, physical process, treat
temporary:   2  temp, temporary, temporary worker, impermanent (vs. permanent), irregular
unique:      2  alone(predicate), unique, unequaled, unequalled, unique(predicate), singular
usage:       2  use, usage, utilization, utilisation, custom, usance
vocabulary:  2  vocabulary, lexicon, mental lexicon