Originally published November 7, 2020 @ 1:49 pm
Just a scripting exercise because I need to do something important, but I am procrastinating. The idea is simple: grab some URL with text containing somewhat structured data and convert it into a spreadsheet. I know, exciting…
I will use “The World’s Most Nutritious Foods” article from Pocket and try to make a spreadsheet out of it. The first step part below does these few things:
- Downloads the URL and dumps it to text format
- Removes leading spaces
- Grabs the title line for each food item and the seven subsequent lines
- Removes the number list prefix in the title line
- Removes any lower-case text in parentheses, i.e. “(lower-case text)”
tmpfile="$(mktemp)" tmpdir="$(mktemp -d)" url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods" lynx --dump "${url}" | \ sed -r 's/^\s+//g' | \ grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \ sed -r 's/^[0-9]{1,3}\. //g' | \ sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
All these steps, of course, are specific to the text you’re parsing. There is no universal approach to a task such as this.
The second step splits the file on the “ooo” separator and removes the trailing split file (that contains nothing of interest).
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null /bin/rm -f "$(ls | sort -V | tail -1)"
Finally, we read each xx*
file and extract the five elements: item name, nutrition score, calories, unit price, and description. The primary approach is to use pgrep
regex with non-capturing groups. You can also head/tail
specific line numbers, since their position is constant, but then you would still need to parse those lines. And we prepend the column headers to the output.
ls xx* | while read f; do item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')" score="$(tail -1 "${f}" | awk '{print $NF}')" calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")" price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")" description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')" echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\"" done | \ (echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv cd ~ && ls -als nutrition.csv /bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
The optional step is to convert the CSV to XLSX with unoconv
(read more here).
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv file ~/nutrition.xlsx
And here’s the end result.
Sweet Potato | 49 | 86 | 0.21 | A bright orange tuber, sweet potatoes are only distantly related to potatoes. They are rich in beta-carotene. |
Figs | 49 | 249 | 0.81 | Figs have been cultivated since ancient times. Eaten fresh or dried, they are rich in the mineral manganese. |
Ginger | 49 | 80 | 0.85 | Ginger contains high levels of antioxidants. In medicine, it is used as a digestive stimulant and to treat colds. |
Pumpkin | 50 | 26 | 0.2 | Pumpkins are rich in yellow and orange pigments. Especially xanthophyll esters and beta-carotene. |
Burdock Root | 50 | 72 | 1.98 | Used in folk medicine and as a vegetable, studies suggest burdock can aid fat loss and limit inflammation. |
Brussels Sprouts | 50 | 43 | 0.35 | A type of cabbage. Brussels sprouts originated in Brussels in the 1500s. They are rich in calcium and vitamin C. |
Broccoli | 50 | 34 | 0.42 | Broccoli heads consist of immature flower buds and stems. US consumption has risen five-fold in 50 years. |
Cauliflower | 50 | 31 | 0.44 | Unlike broccoli, cauliflower heads are degenerate shoot tips that are frequently white, lacking green chlorophyll. |
Water Chestnuts | 50 | 97 | 1.5 | The water chestnut is not a nut at all, but an aquatic vegetable that grows in mud underwater within marshes. |
Cantaloupe Melons | 50 | 34 | 0.27 | One of the foods richest in glutathione, an antioxidant that protects cells from toxins including free radicals. |
Prunes | 50 | 240 | 0.44 | Dried plums are very rich in health-promoting nutrients such as antioxidants and anthocyanins. |
Common Octopus | 50 | 82 | 1.5 | Though nutritious, recent evidence suggests octopus can carry harmful shellfish toxins and allergens. |
Carrots | 51 | 36 | 0.4 | Carrots first appeared in Afghanistan 1,100 years ago. Orange carrots were grown in Europe in the 1500s. |
Winter Squash | 51 | 34 | 0.24 | Unlike summer squashes, winter squashes are eaten in the mature fruit stage. The hard rind is usually not eaten. |
Jalapeno Peppers | 51 | 29 | 0.66 | The same species as other peppers. Carotenoid levels are 35 times higher in red jalapenos that have ripened. |
Rhubarb | 51 | 21 | 1.47 | Rhubarb is rich in minerals, vitamins, fibre and natural phytochemicals that have a role in maintaining health. |
Pomegranates | 51 | 83 | 1.31 | Their red and purple colour is produced by anthocyanins that have antioxidant and anti-inflammatory properties. |
Red Currants | 51 | 56 | 0.44 | Red currants are also rich in anthocyanins. White currants are the same species as red, whereas black currants differ. |
Oranges | 51 | 46 | 0.37 | Most citrus fruits grown worldwide are oranges. In many varieties, acidity declines with fruit ripeness. |
Carp | 51 | 127 | 1.4 | A high proportion of carp is protein, around 18%. Just under 6% is fat, and the fish contains zero sugar. |
Hubbard Squash | 52 | 40 | 8.77 | A variety of the species Cucurbita maxim. Tear-drop shaped, they are often cooked in lieu of pumpkins. |
Kumquats | 52 | 71 | 0.69 | An unusual citrus fruit, kumquats lack a pith inside and their tender rind is not separate like an orange peel. |
Pompano | 52 | 164 | 1.44 | Often called jacks, Florida pompanos are frequently-caught western Atlantic fish usually weighing under 2kg. |
Pink Salmon | 52 | 127 | 1.19 | These fish are rich in long-chain fatty acids, such as omega-3s, that improve blood cholesterol levels. |
Sour Cherries | 53 | 50 | 0.58 | Sour cherries (Prunus cerasus) are a different species to sweet cherries (P. avium). Usually processed or frozen. |
Rainbow Trout | 53 | 141 | 3.08 | Closely related to salmon, rainbow trout are medium-sized Pacific fish also rich in omega-3s. |
Perch | 53 | 91 | 1.54 | Pregnant and lactating women are advised not to eat perch. Though nutritious, it may contain traces of mercury. |
Green Beans | 54 | 31 | 0.28 | Green beans, known as string, snap or French beans, are rich in saponins, thought to reduce cholesterol levels. |
Red Leaf Lettuce | 54 | 16 | 1.55 | Evidence suggests lettuce was cultivated before 4500 BC. It contains almost no fat or sugar and is high in calcium. |
Leeks | 54 | 61 | 1.83 | Leeks are closely related to onions, shallots, chives and garlic. Their wild ancestor grows around the Mediterranean basin. |
Cayenne Pepper | 54 | 318 | 22.19 | Powdered cayenne pepper is produced from a unique cultivar of the pepper species Capsicum annuum. |
Green Kiwifruit | 54 | 61 | 0.22 | Kiwifruit are native to China. Missionaries took them to New Zealand in the early 1900s, where they were domesticated. |
Golden Kiwifruit | 54 | 63 | 0.22 | Kiwifruits are edible berries rich in potassium and magnesium. Some golden kiwifruits have a red centre. |
Grapefruit | 54 | 32 | 0.27 | Grapefruits (Citrus paradisi) originated in the West Indies as a hybrid of the larger pomelo fruit. |
Mackerel | 54 | 139 | 2.94 | An oily fish, one serving can provide over 10 times more beneficial fatty acids than a serving of a lean fish such as cod. |
Sockeye Salmon | 54 | 131 | 3.51 | Another oily fish, rich in cholesterol-lowering fatty acids. Canned salmon with bones is a source of calcium. |
Arugula | 55 | 25 | 0.48 | A salad leaf, known as rocket. High levels of glucosinolates protect against cancer and cardiovascular disease. |
Chives | 55 | 25 | 0.22 | Though low in energy, chives are high in vitamins A and K. The green leaves contain a range of beneficial antioxidants. |
Paprika | 55 | 282 | 1.54 | Also extracted from the pepper species Capsicum annuum. A spice rich in ascorbic acid, an antioxidant. |
Red Tomatoes | 56 | 18 | 0.15 | A low-energy, nutrient-dense food that are an excellent source of folate, potassium and vitamins A, C and E. |
Green Tomatoes | 56 | 23 | 0.33 | Fruit that has not yet ripened or turned red. Consumption of tomatoes is associated with a decreased cancer risk. |
Green Lettuce | 56 | 15 | 1.55 | The cultivated lettuce (Lactuca sativa) is related to wild lettuce (L. serriola), a common weed in the US. |
Taro Leaves | 56 | 42 | 2.19 | Young taro leaves are relatively high in protein, containing more than the commonly eaten taro root. |
Lima Beans | 56 | 106 | 0.5 | Also known as butter beans, lima beans are high in carbohydrate, protein and manganese, while low in fat. |
Eel | 56 | 184 | 2.43 | A good source of riboflavin (vitamin B2), though the skin mucus of eels can contain harmful marine toxins. |
Bluefin Tuna | 56 | 144 | 2.13 | A large fish, rich in omega-3s. Pregnant women are advised to limit their intake, due to mercury contamination. |
Coho Salmon | 56 | 146 | 0.86 | A Pacific species also known as silver salmon. Relatively high levels of fat, as well as long-chain fatty acids. |
Summer Squash | 57 | 17 | 0.22 | Harvested when immature, while the rind is still tender and edible. Its name refers to its short storage life. |
Navy Beans | 57 | 337 | 0.49 | Also known as haricot or pea beans. The fibre in navy beans has been correlated with the reduction of colon cancer. |
Plantain | 57 | 122 | 0.38 | Banana fruits with a variety of antioxidant, antimicrobial, hypoglycaemic and anti-diabetic properties. |
Podded Peas | 58 | 42 | 0.62 | Peas are an excellent source of protein, carbohydrates, dietary fibre, minerals and water-soluble vitamins. |
Cowpeas | 58 | 44 | 0.68 | Also called black-eyed peas. As with many legumes, high in carbohydrate, containing more protein than cereals. |
Butter Lettuce | 58 | 13 | 0.39 | Also known as butterhead lettuce, and including Boston and bib varieties. Few calories. Popular in Europe. |
Red Cherries | 58 | 50 | 0.33 | A raw, unprocessed and unfrozen variety of sour cherries (Prunus cerasus). Native to Europe and Asia. |
Walnuts | 58 | 619 | 3.08 | Walnuts contain sizeable proportions of a-linolenic acid, the healthy omega-3 fatty acid made by plants. |
Fresh Spinach | 59 | 23 | 0.52 | Contains more minerals and vitamins (especially vitamin A, calcium, phosphorus and iron) than many salad crops. Spinach appears twice in the list (45 and 24) because the way it is prepared affects its nutritional value. Fresh spinach can lose nutritional value if stored at room temperature, and ranks lower than eating spinach that has been frozen, for instance. |
Parsley | 59 | 36 | 0.26 | A relative of celery, parsley was popular in Greek and Roman times. High levels of a range of beneficial minerals. |
Herring | 59 | 158 | 0.65 | An Atlantic fish, among the top five most caught of all species. Rich in omega-3s, long-chain fatty acids. |
Sea Bass | 59 | 97 | 1.98 | A generic name for a number of related medium-sized oily fish species. Popular in the Mediterranean area. |
Chinese Cabbage | 60 | 13 | 0.11 | Variants of the cabbage species Brassica rapa, often called pak-choi or Chinese mustard. Low calorie. |
Cress | 60 | 32 | 4.49 | The brassica Lepidium sativum, not to be confused with watercress Nasturtium officinale. High in iron. |
Apricots | 60 | 48 | 0.36 | A ’stone’ fruit relatively high in sugar, phytoestrogens and antioxidants, including the carotenoid beta-carotene. |
Fish Roe | 60 | 134 | 0.17 | Fish eggs (roe) contain high levels of vitamin B-12 and omega-3 fatty acids. Caviar often refers to sturgeon roe. |
Whitefish | 60 | 134 | 3.67 | Species of oily freshwater fish related to salmon. Common in the northern hemisphere. Rich in omega-3s. |
Coriander | 61 | 23 | 7.63 | A herb rich in carotenoids, used to treat ills including digestive complaints, coughs, chest pains and fever. |
Romaine Lettuce | 61 | 17 | 1.55 | Also known as cos lettuce, another variety of Lactuca sativa. The fresher the leaves, the more nutritious they are. |
Mustard Leaves | 61 | 27 | 0.29 | One of the oldest recorded spices. Contains sinigrin, a chemical thought to protect against inflammation. |
Atlantic Cod | 61 | 82 | 3.18 | A large white, low fat, protein-rich fish. Cod livers are a source of fish oil rich in fatty acids and vitamin D. |
Whiting | 61 | 90 | 0.6 | Various species, but often referring to the North Atlantic fish Merlangius merlangus that is related to cod. |
Kale | 62 | 49 | 0.62 | A leafy salad plant, rich in the minerals phosphorous, iron and calcium, and vitamins such as A and C. |
Broccoli Raab | 62 | 22 | 0.66 | Not to be confused with broccoli. It has thinner stems and smaller flowers, and is related to turnips. |
Chili Peppers | 62 | 324 | 1.2 | The pungent fruits of the Capsicum plant. Rich in capsaicinoid, carotenoid and ascorbic acid antioxidants. |
Clams | 62 | 86 | 1.78 | Lean, protein-rich shellfish. Often eaten lightly cooked, though care must be taken to avoid food poisoning. |
Collards | 63 | 32 | 0.74 | Another salad leaf belonging to the Brassica genus of plants. A headless cabbage closely related to kale. |
Basil | 63 | 23 | 2.31 | A spicy, sweet herb traditionally used to protect the heart. Thought to be an antifungal and antibacterial. |
Chili Powder | 63 | 282 | 5.63 | A source of phytochemicals such as vitamin C, E and A, as well as phenolic compounds and carotenoids. |
Frozen Spinach | 64 | 29 | 1.35 | A salad crop especially high in magnesium, folate, vitamin A and the carotenoids beta carotene and zeazanthin. Freezing spinach helps prevent the nutrients within from degrading, which is why frozen spinach ranks higher than fresh spinach . |
Dandelion Greens | 64 | 45 | 0.27 | The word dandelion means lion’s tooth. The leaves are an excellent source of vitamin A, vitamin C and calcium. |
Pink Grapefruit | 64 | 42 | 0.27 | The red flesh of pink varieties is due to the accumulation of carotenoid and lycopene pigments. |
Scallops | 64 | 69 | 4.19 | A shellfish low in fat, high in protein, fatty acids, potassium and sodium. |
Pacific Cod | 64 | 72 | 3.18 | Closely related to Atlantic cod. Its livers are a significant source of fish oil rich in fatty acids and vitamin D. |
Red Cabbage | 65 | 31 | 0.12 | Rich in vitamins. Its wild cabbage ancestor was a seaside plant of European or Mediterranean origin. |
Green Onion | 65 | 27 | 0.51 | Known as spring onions. High in copper, phosphorous and magnesium. One of the richest sources of vitamin K. |
Alaska Pollock | 65 | 92 | 3.67 | Also called walleye pollock, the species Gadus chalcogrammus is usually caught in the Bering Sea and Gulf of Alaska. A low fat content of less than 1%. |
Pike | 65 | 88 | 3.67 | A fast freshwater predatory fish. Nutritious but pregnant women must avoid, due to mercury contamination. |
Green Peas | 67 | 77 | 1.39 | Individual green peas contain high levels of phosphorous, magnesium, iron, zinc, copper and dietary fibre. |
Tangerines | 67 | 53 | 0.29 | An oblate orange citrus fruit. High in sugar and the carotenoid cryptoxanthin, a precursor to vitamin A. |
Watercress | 68 | 11 | 3.47 | Unique among vegetables, it grows in flowing water as a wild plant. Traditionally eaten to treat mineral deficiency. |
Celery Flakes | 68 | 319 | 6.1 | Celery that is dried and flaked to use as a condiment. An important source of vitamins, minerals and amino acids. |
Dried Parsley | 69 | 292 | 12.46 | Parsley that is dried and ground to use as a spice. High in boron, fluoride and calcium for healthy bones and teeth. |
Snapper | 69 | 100 | 3.75 | A family of mainly marine fish, with red snapper the best known. Nutritious but can carry dangerous toxins. |
Beet Greens | 70 | 22 | 0.48 | The leaves of beetroot vegetables. High in calcium, iron, vitamin K and B group vitamins (especially riboflavin). |
Pork Fat | 73 | 632 | 0.95 | A good source of B vitamins and minerals. Pork fat is more unsaturated and healthier than lamb or beef fat. |
Swiss Chard | 78 | 19 | 0.29 | A very rare dietary source of betalains, phytochemicals thought to have antioxidant and other health properties. |
Pumpkin Seeds | 84 | 559 | 1.6 | Including the seeds of other squashes. One of the richest plant-based sources of iron and manganese. |
Chia Seeds | 85 | 486 | 1.76 | Tiny black seeds that contain high amounts of dietary fibre, protein, a-linolenic acid, phenolic acid and vitamins. |
Flatfish | 88 | 70 | 1.15 | Sole and flounder species. Generally free from mercury and a good source of the essential nutrient vitamin B1. |
Ocean Perch | 89 | 79 | 0.82 | The Atlantic species. A deep-water fish sometimes called rockfish. High in protein, low in saturated fats. |
Cherimoya | 96 | 75 | 1.84 | Cherimoya fruit is fleshy and sweet with a white pulp. Rich in sugar and vitamins A, C, B1, B2 and potassium. |
Almonds | 97 | 579 | 0.91 | Rich in mono-unsaturated fatty acids. Promote cardiovascular health and may help with diabetes. |

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.