voicevoxの音声合成を試す

しむどん： 2023-01-21

先日ChatGPTを使ってEmacsのdoctorの機能を拡張した¹。会話の精度が段違いに向上し、格好の話し相手になった。何度も話をしていると、声を聞いてみたいという欲求が出てくる。だから発話させることにした。

ChatGPTはテキストで返してくるので、テキストを音声に変換できればよい。つまりテキスToスピーチと呼ばれる機能があればよいことになる。現代は本当に便利なもので、そういうものはSaaSとして提供されているAPIもあるし、ローカルで起動するプログラムもある。しかも、低価格もしくは無料で、すぐに使用できるものが沢山存在する。

以前、OpenJTalkを用いて読み上げのEmacs Lispを実装した事もあった²。それでも良かったのだが、他の方法も含め再度どれが良いかをインターネットで検索し、voicevoxというツールがニーズにあっていそうだった。そのため、今回はvoicevoxを使ってみることにした。

dockerでAPIを起動する

voicevoxはOSSであり、またDockerイメージも提供している。今回はvoicevoxの簡単な使い方だけを把握したい。だから、提供されているDockerイメージをそのまま使用する。

docker pull voicevox/voicevox_engine:cpu-ubuntu20.04-latest

Dockerイメージを取得する

Dockerイメージを取得したら、Dockerコンテナを起動する。

docker run --rm -it -p '127.0.0.1:50021:50021' voicevox/voicevox_engine:cpu-ubuntu20.04-latest

Dockerコンテナを起動する

コンテナを起動するとWeb APIを受け付けるようになる。

音声を合成する

音声合成に必要なパラメータを取得する。

POST http://localhost:50021/audio_query?speaker=1&text=こんにちわ
Content-Type: application/json

{
  "accent_phrases": [
    {
      "moras": [
        {
          "text": "コ",
          "consonant": "k",
          "consonant_length": 0.10002632439136505,
          "vowel": "o",
          "vowel_length": 0.15740256011486053,
          "pitch": 5.714912414550781
        },
        {
          "text": "ン",
          "consonant": null,
          "consonant_length": null,
          "vowel": "N",
          "vowel_length": 0.08265873789787292,
          "pitch": 5.8854217529296875
        },
        {
          "text": "ニ",
          "consonant": "n",
          "consonant_length": 0.03657080978155136,
          "vowel": "i",
          "vowel_length": 0.117112897336483,
          "pitch": 5.998487949371338
        },
        {
          "text": "チ",
          "consonant": "ch",
          "consonant_length": 0.08808862417936325,
          "vowel": "i",
          "vowel_length": 0.09015568345785141,
          "pitch": 5.977110385894775
        },
        {
          "text": "ワ",
          "consonant": "w",
          "consonant_length": 0.08290570229291916,
          "vowel": "a",
          "vowel_length": 0.2083434909582138,
          "pitch": 6.048254013061523
        }
      ],
      "accent": 5,
      "pause_mora": null,
      "is_interrogative": false
    }
  ],
  "speedScale": 1.0,
  "pitchScale": 0.0,
  "intonationScale": 1.0,
  "volumeScale": 1.0,
  "prePhonemeLength": 0.1,
  "postPhonemeLength": 0.1,
  "outputSamplingRate": 24000,
  "outputStereo": false,
  "kana": "コンニチワ'"
}
// POST http://localhost:50021/audio_query?speaker=1&text=こんにちわ
// HTTP/1.1 200 OK
// date: Mon, 16 Jan 2023 22:17:10 GMT
// server: uvicorn
// content-length: 981
// content-type: application/json
// Request duration: 0.053660s

取得したパラメータを使い音声を合成し、wavファイルを取得する。

POST http://localhost:50021/synthesis?speaker=1
Content-Type: application/json

{
  "accent_phrases": [
    {
      "moras": [
        {
          "text": "コ",
          "consonant": "k",
          "consonant_length": 0.10002632439136505,
          "vowel": "o",
          "vowel_length": 0.15740256011486053,
          "pitch": 5.714912414550781
        },
        {
          "text": "ン",
          "consonant": null,
          "consonant_length": null,
          "vowel": "N",
          "vowel_length": 0.08265873789787292,
          "pitch": 5.8854217529296875
        },
        {
          "text": "ニ",
          "consonant": "n",
          "consonant_length": 0.03657080978155136,
          "vowel": "i",
          "vowel_length": 0.117112897336483,
          "pitch": 5.998487949371338
        },
        {
          "text": "チ",
          "consonant": "ch",
          "consonant_length": 0.08808862417936325,
          "vowel": "i",
          "vowel_length": 0.09015568345785141,
          "pitch": 5.977110385894775
        },
        {
          "text": "ワ",
          "consonant": "w",
          "consonant_length": 0.08290570229291916,
          "vowel": "a",
          "vowel_length": 0.2083434909582138,
          "pitch": 6.048254013061523
        }
      ],
      "accent": 5,
      "pause_mora": null,
      "is_interrogative": false
    }
  ],
  "speedScale": 1.0,
  "pitchScale": 0.0,
  "intonationScale": 1.0,
  "volumeScale": 1.0,
  "prePhonemeLength": 0.1,
  "postPhonemeLength": 0.1,
  "outputSamplingRate": 24000,
  "outputStereo": false,
  "kana": "コンニチワ'"
}

a.wav

再生

wavファイルはどのように再生しても良いが、私はmacOSユーザーだから、afplayコマンドを用いて再生することにした。

afplay a.wav

EmacsでのテキストToスピーチ

voicevoxをEmacsから使用するための拡張を実装した。

;;; voicevox --- Voicevox utility for Emacs.

;; Copyright (C) 2024 TakesxiSximada

;; Author: TakesxiSximada
;; Maintainer: TakesxiSximada
;; Version: 1.0
;; Package-Version: 20230116.0000
;; Package-Requires: ((emacs "29.1"))
;; Date: 2023-01-16

;; This file is not part of GNU Emacs.

;;; License:

;; This program is free software: you can redistribute it and/or
;; modify it under the terms of the GNU Affero General Public License as
;; published by the Free Software Foundation, either version 3 of the
;; License, or (at your option) any later version.

;; This program is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
;; Affero General Public License for more details.

;; You should have received a copy of the GNU Affero General Public
;; License along with this program.  If not, see
;; <https://www.gnu.org/licenses/>.

;;; Commentary:

;; Voicevox Emacs Integration.

;;; Code:

(require 'plz)

(defvar voicevox-audio-file "test.wav")
(defvar voicevox-current-sentense "こんにちわ")

(defun voicevox-set (&optional sentence)
  (interactive "s")
  (setq voicevox-current-sentense sentence))

(defun voicevox-set-region (&optional beg end)
  (interactive "r")
  (setq voicevox-current-sentense (buffer-substring-no-properties beg end)))

(defun voicevox-play ()
  (interactive)
  (voicevox-cleint-fetch-audio-query))
;-------------------------------------------------------------------

(defvar voicevox-server-buffer-name "*VOICEVOX SERVER*")
(defvar voicevox-server-command
  '("docker" "run" "--rm" "-it" "-p" "127.0.0.1:50021:50021" "voicevox/voicevox_engine:cpu-ubuntu20.04-latest"))

(defvar voicevox-server-stop-signal-code 15)  ;; SIGTERM
(defun voicevox-server-start ()
  (interactive)
  (make-process :name "VOICEBOX SERVER"
		:buffer voicevox-server-buffer-name
		:command voicevox-server-command))


(defun voicevox-server-stop ()
  (interactive)
  (signal-process
   (get-buffer-process (get-buffer voicevox-server-buffer-name))
   voicevox-server-stop-signal-code))


;-------------------------------------------------------------------

(require 'plz)
(require 'json)

(defvar voicevox-client-request-synthesis-param nil)
(setq voicevox-client-fetch-audio-query-success-hook nil)
(setq voicevox-client-fetch-synthesis-success-hook nil)

(defun voicevox-cleint-fetch-audio-query ()
  (interactive)
  (plz 'post (format
	      "http://localhost:50021/audio_query?speaker=1&text=%s"
	      (url-encode-url voicevox-current-sentense))
    :headers '(("Content-Type" . "application/json"))
    :body ""
    :then (lambda (d)
	    (setq voicevox-client-request-synthesis-param d)
	    (run-hooks 'voicevox-client-fetch-audio-query-success-hook))))


(defun voicevox-cleint-fetch-audio-file ()
  (interactive)
  (let ((plz-curl-default-args
	 (append `("-o" ,voicevox-audio-file) plz-curl-default-args)))
    (plz 'post "http://localhost:50021/synthesis?speaker=1"
      :headers '(("Content-Type" . "application/json"))
      :body voicevox-client-request-synthesis-param
      :then (lambda (d)
	      (run-hooks 'voicevox-client-fetch-synthesis-success-hook)))))

;-------------------------------------------------------------------
(defvar voicevox-afplay-executable "afplay")
(defvar voicevox-afplay-buffer-name "*VOICEBOX AFPLAY*")

(defun voicevox-afplay ()
  (interactive)
  (make-process :name "VOICEBOX"
		:buffer voicevox-afplay-buffer-name
		:command `(,voicevox-afplay-executable ,voicevox-audio-file)))

;-------------------------------------------------------------------
(add-hook 'voicevox-client-fetch-audio-query-success-hook 'voicevox-cleint-fetch-audio-file)
(add-hook 'voicevox-client-fetch-synthesis-success-hook 'voicevox-afplay)

;;;###autoload
(defvar voicevox-core-process nil)

;;;###autoload
(defun voicevox-core-start ()
  "voicevox_coreを使って発話するプロセスを開始する"
  (interactive)

  (unless (process-live-p voicevox-core-process) ; プロセスが死んでいる時のみ起動する
    (setq voicevox-core-process
	  (start-process "*VOICEVOX*" (get-buffer-create "*VOICEVOX*")
			 (expand-file-name "~/.cache/python-venvs/speech/bin/python")
			 (expand-file-name "~/ng/symdon/articles/posts/1673875278/speech1.py")))))

;;;###autoload
(defun voicevox-say-on-region ()
  "リージョンの文字列を発話させる"
  (interactive)
  (process-send-string voicevox-core-process (buffer-substring-no-properties (region-beginning) (region-end)))
  (process-send-string voicevox-core-process ""))

(provide 'voicevox)
;; voicevox.el ends here

voicevox_coreを調べる

voicevoxは「エディター」「エンジン」「コア」という3つのコンポーネントで作られている。この中でも「コア」は、実際の音声を生成する部分となっており、 voicevox_core というリポジトリでコードを管理していた。ここでは voicevox_core を直接使ってみる。

voicevox_core はRustで実装されているが、ビルドすると共有ライブラリ(.so, .dynlib, .dll)と、それらに付随する言語バインディングが作られる。

https://github.com/VOICEVOX/voicevox_core からソースコードを取得する。

git clone https://github.com/VOICEVOX/voicevox_core.git
cd voicevox_core

cargoでビルドを行う。

cargo build

この時、共有ファイルとPythonパッケージであるwhlも作られる。Pythonのvnevを作成し、それらをインストールする。

python3 -m venv .venv
source .venv/bin/activate
pip install ./target/wheels/voicevox_core-0.0.0-cp38-abi3-macosx_11_0_arm64.whl

voicevox_coreのリリースページには、ダウンロード用のプログラムがある。今回はダウンロードしたデータを使う事はしないが、OpenJtalkの辞書データ open_jtalk_dic_utf_8-1.11 は使う事にした。

また音声モデルは、リポジトリに梱包されているサンプルデータを使う。これは、どうやらZIPファイルである必要があるようなので、予めアーカイブしておく。

cd model/sample.vvm
zip ../../sample.vvm.zip ./*
cd ../../

これらを使いつつ簡単なPythonプログラムで音声ファイルを作成してみる。

onnxruntime = Onnxruntime.load_once(filename="./target/release/libonnxruntime.dylib")
opejtalk = OpenJtalk("~/open_jtalk_dic_utf_8-1.11")
synthesizer = Synthesizer(onnxruntime, opejtalk, AccelerationMode.CPU)
model = VoiceModelFile.open("./sample.vvm.zip")
synthesizer.load_voice_model(model)
audio_query = synthesizer.audio_query("こんにちわ", style_id=0)
wav = synthesizer.synthesis(audio_query, style_id=0)

fp = open("a.wav", "w+b")
fp.write(wav)
fp.close()

これで a.wav を再生する事で発声できる。

Pythonの環境を作る

環境を作成していく。

python3 -m venv .venv
source .venv/bin/activate

pip install sounddevice ~/ng/voicevox_core/target/wheels/voicevox_core-0.0.0-cp38-abi3-macosx_11_0_arm64.whl

音声データを再生する

音声データを再生する方はいくつかある。ここではPythonのsounddeviceを使ってWAVデータを再生する方法を学ぶ。まずは sounddevice パッケージをPyPIからインストールしよう。

pip install sounddevice numpy

正常にインストールできると、Pythonでsounddeviceをインポートできるようになる。

import sounddevice

適当なWAVファイルを使って、そのデータを再生してみる。

fp = open("~/ng/voicevox_core/downloads/voicevox_core/a.wav", "rb")
wav_data = fp.read()
fp.close()

このwav_dataをsounddeviceで扱える形になるように変換する。wav_dataはバイト列だが、WAVを扱うには16ビットの整数の配列のほうが扱いやすい。dtypeで16ビットの整数であるnumpy.int16を指定している。countに len(wav_data)//2 を渡しているのは、バイト列を16ビットずつ(つまり2バイトずつ)読み込むためだ。その後、その各値の範囲を-1.0から1.0の間になるように正規化している。

16ビットの整数値は通常、-32768から32767の範囲となるが、これを-1.0から1.0の範囲の浮動小数点数に収めるために data_s16 * 0.5**15 をしている。これは各値を32768で割るのと同じだ。

data_s16 = np.frombuffer(wav_data, dtype=np.int16, count=len(wav_data)//2, offset=0)
float_data = data_s16 * 0.5**15

変換したデータを sounddevice.play() を使って再生する。

sounddevice.play(float_data, samplerate=44100, blocking=True)

整理して使えるようにする

import sounddevice as sd
import numpy as np
from voicevox_core import AccelerationMode, AudioQuery, wav_from_s16le
from voicevox_core.blocking import Onnxruntime, OpenJtalk, Synthesizer, VoiceModelFile

onnxruntime_path = "~/ng/voicevox_core/target/release/libonnxruntime.dylib"
openjtalk_dict_path = "~/ng/voicevox_core/downloads/voicevox_core/open_jtalk_dic_utf_8-1.11"
vmm_path = "~/ng/voicevox_core/model/sample.vvm.zip"

onnxruntime = Onnxruntime.load_once(filename=onnxruntime_path)
openjtalk = OpenJtalk(openjtalk_dict_path)
synthesizer = Synthesizer(onnxruntime, openjtalk, AccelerationMode.CPU)

model = VoiceModelFile.open(vmm_path)
synthesizer.load_voice_model(model)

text = """こんにちわ。僕はしむどんです。"""

audio_query = synthesizer.audio_query(text, style_id=0)
wav_data = synthesizer.synthesis(audio_query, style_id=0)
data_s16 = np.frombuffer(
    wav_data, dtype=np.int16, count=len(wav_data)//2, offset=0)
float_data = data_s16 * 0.5**15

sounddevice.play(float_data, samplerate=44100, blocking=True)

発話する文章を標準入力で受け取るスクリプトを実装する

speech1.py

#! /usr/bin/env python
"""
voicevox_coreを使って文字列を音声データに変換し発話までを行う。

標準入力に渡された文字列を発声する。
"""
import os
import sys
import sounddevice as sd
import numpy as np
from voicevox_core import AccelerationMode, AudioQuery, wav_from_s16le
from voicevox_core.blocking import Onnxruntime, OpenJtalk, Synthesizer, VoiceModelFile

__version__ = "1.0.0"
__author__ = "TakesxiSximada"

def _expand(path):
    return os.path.abspath(os.path.expanduser(path))

onnxruntime_path = _expand("~/ng/voicevox_core/target/release/libonnxruntime.dylib")
openjtalk_dict_path = _expand("~/ng/voicevox_core/downloads/voicevox_core/open_jtalk_dic_utf_8-1.11")
# vmm_path = _expand("~/ng/voicevox_core/model/sample.vvm.zip")
vmm_path = _expand("~/ng/symdon/articles/posts/1673875278/1.vvm")


def main():
    onnxruntime = Onnxruntime.load_once(filename=onnxruntime_path)
    openjtalk = OpenJtalk(openjtalk_dict_path)

    with Synthesizer(onnxruntime, openjtalk, AccelerationMode.CPU) as synthesizer:
        with VoiceModelFile.open(vmm_path) as model:
            synthesizer.load_voice_model(model)

            while True:
                text = sys.stdin.read()

                audio_query = synthesizer.audio_query(text, style_id=1)
                wav_data = synthesizer.synthesis(audio_query, style_id=1)
                data_s16 = np.frombuffer(
                    wav_data, dtype=np.int16, count=len(wav_data)//2, offset=0)
                float_data = data_s16 * 0.5**15

                sd.play(float_data, samplerate=44100, blocking=True)


if __name__ == "__main__":
    main()

Emacsから使えるようにする

voicevox.el

;;; voicevox --- Voicevox utility for Emacs.

;; Copyright (C) 2024 TakesxiSximada

;; Author: TakesxiSximada
;; Maintainer: TakesxiSximada
;; Version: 1.0
;; Package-Version: 20230116.0000
;; Package-Requires: ((emacs "29.1"))
;; Date: 2023-01-16

;; This file is not part of GNU Emacs.

;;; License:

;; This program is free software: you can redistribute it and/or
;; modify it under the terms of the GNU Affero General Public License as
;; published by the Free Software Foundation, either version 3 of the
;; License, or (at your option) any later version.

;; This program is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
;; Affero General Public License for more details.

;; You should have received a copy of the GNU Affero General Public
;; License along with this program.  If not, see
;; <https://www.gnu.org/licenses/>.

;;; Commentary:

;; Voicevox Emacs Integration.

;;; Code:

(require 'plz)

(defvar voicevox-audio-file "test.wav")
(defvar voicevox-current-sentense "こんにちわ")

(defun voicevox-set (&optional sentence)
  (interactive "s")
  (setq voicevox-current-sentense sentence))

(defun voicevox-set-region (&optional beg end)
  (interactive "r")
  (setq voicevox-current-sentense (buffer-substring-no-properties beg end)))

(defun voicevox-play ()
  (interactive)
  (voicevox-cleint-fetch-audio-query))
;-------------------------------------------------------------------

(defvar voicevox-server-buffer-name "*VOICEVOX SERVER*")
(defvar voicevox-server-command
  '("docker" "run" "--rm" "-it" "-p" "127.0.0.1:50021:50021" "voicevox/voicevox_engine:cpu-ubuntu20.04-latest"))

(defvar voicevox-server-stop-signal-code 15)  ;; SIGTERM
(defun voicevox-server-start ()
  (interactive)
  (make-process :name "VOICEBOX SERVER"
		:buffer voicevox-server-buffer-name
		:command voicevox-server-command))


(defun voicevox-server-stop ()
  (interactive)
  (signal-process
   (get-buffer-process (get-buffer voicevox-server-buffer-name))
   voicevox-server-stop-signal-code))


;-------------------------------------------------------------------

(require 'plz)
(require 'json)

(defvar voicevox-client-request-synthesis-param nil)
(setq voicevox-client-fetch-audio-query-success-hook nil)
(setq voicevox-client-fetch-synthesis-success-hook nil)

(defun voicevox-cleint-fetch-audio-query ()
  (interactive)
  (plz 'post (format
	      "http://localhost:50021/audio_query?speaker=1&text=%s"
	      (url-encode-url voicevox-current-sentense))
    :headers '(("Content-Type" . "application/json"))
    :body ""
    :then (lambda (d)
	    (setq voicevox-client-request-synthesis-param d)
	    (run-hooks 'voicevox-client-fetch-audio-query-success-hook))))


(defun voicevox-cleint-fetch-audio-file ()
  (interactive)
  (let ((plz-curl-default-args
	 (append `("-o" ,voicevox-audio-file) plz-curl-default-args)))
    (plz 'post "http://localhost:50021/synthesis?speaker=1"
      :headers '(("Content-Type" . "application/json"))
      :body voicevox-client-request-synthesis-param
      :then (lambda (d)
	      (run-hooks 'voicevox-client-fetch-synthesis-success-hook)))))

;-------------------------------------------------------------------
(defvar voicevox-afplay-executable "afplay")
(defvar voicevox-afplay-buffer-name "*VOICEBOX AFPLAY*")

(defun voicevox-afplay ()
  (interactive)
  (make-process :name "VOICEBOX"
		:buffer voicevox-afplay-buffer-name
		:command `(,voicevox-afplay-executable ,voicevox-audio-file)))

;-------------------------------------------------------------------
(add-hook 'voicevox-client-fetch-audio-query-success-hook 'voicevox-cleint-fetch-audio-file)
(add-hook 'voicevox-client-fetch-synthesis-success-hook 'voicevox-afplay)

;;;###autoload
(defvar voicevox-core-process nil)

;;;###autoload
(defun voicevox-core-start ()
  "voicevox_coreを使って発話するプロセスを開始する"
  (interactive)

  (unless (process-live-p voicevox-core-process) ; プロセスが死んでいる時のみ起動する
    (setq voicevox-core-process
	  (start-process "*VOICEVOX*" (get-buffer-create "*VOICEVOX*")
			 (expand-file-name "~/.cache/python-venvs/speech/bin/python")
			 (expand-file-name "~/ng/symdon/articles/posts/1673875278/speech1.py")))))

;;;###autoload
(defun voicevox-say-on-region ()
  "リージョンの文字列を発話させる"
  (interactive)
  (process-send-string voicevox-core-process (buffer-substring-no-properties (region-beginning) (region-end)))
  (process-send-string voicevox-core-process ""))

(provide 'voicevox)
;; voicevox.el ends here

Emacsの対話セラピー機能doctorをChatGPTに対応させる

文章を音声で読み上げるEmacs Lisp