本を執筆する時の工夫

しむどん： 2025-07-10

本に関わる仕事をしていると、文章を書く事も読む事も多い。制作の過程でそれらの文章を読んでいると、文字だけで読むと気づきにくい文章の問題も、音声で聞くと気づくことがある。そのため文章を読む時は、コンピュータに読み上げさせて耳で読み進めながら、目で字を追うというような事をしている。当然読み上げの品質が良いと作業を進めやすくなる。

そこで、文章の読み上げの質を向上させるために、原稿のマークアップ表記を除去し、自然な文章に変換する。特にAsciiDocやMarkdownのようなマークアップ言語を用いた原稿については、以下の手順で変換する。

AsciiDocをXMLに変換
XMLから不要な情報（画像タグやリンク、コードブロック）を除去し、クリーンなテキストに変換

この処理を行うためのDockerfileとスクリプトを作った。コードなどのファイルを掲載しておく。

Dockerfile

# image: asciidoc2text
# build: docker build -t asciidoc2text .
FROM ruby:3.2.0-alpine

RUN apk add --no-cache build-base python3 py3-pip && \
  gem install --no-document asciidoctor && \
  pip install beautifulsoup4 && \
  apk del build-base

COPY asciidoc2text /usr/local/bin/
COPY docbook5totext.py /usr/local/bin/

RUN chmod 744 /usr/local/bin/asciidoc2text
RUN chmod 744 /usr/local/bin/docbook5totext.py

CMD ascii2text

docbook5totext.py

#! /usr/bin/env python
import re
import sys
import warnings

from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning

warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

input_filename = sys.argv[1]

with open(input_filename) as fp:
    soup = BeautifulSoup(fp, "html.parser")

for a in soup.find_all('a'):
    a.replace_with(a.get_text())

for code_block in soup.find_all(['pre', 'code', 'programlisting', 'literallayout']):
    code_block.decompose()

text = soup.get_text(separator=' ', strip=True)
text = re.sub(r'\[fn:[^\]]+\]', '注釈', text)
text = re.sub(r'https?://', '', text)

print(text)

asciidoc2text

#! /usr/bin/env sh
set -e -x
asciidoctor -b docbook5 $1 -o _temp.xml
exec docbook5totext.py _temp.xml