March 24, 2020
There was a headline on hackernews a while back, "AI Clones Your Voice After Listening for 5 Seconds (2018) (google.github.io)" It's something called Tacotron (/täkōˌträn/) by Google. And on github today are repos like ForwardTacotron forked from WaveRNN and keithito/tacotron and Kyubyong/tacotron but everything seems to be in python.

These are notes from trying to convert ForwardTacotron to golang.

First thing we will need is The LJ Speech Dataset which is a public domain speech dataset consisting of 13,100 short audio clips (.WAV files) of a single speaker reading passages from 7 non-fiction books.

In the python version the first step is:
preprocess.py --path /path/to/ljspeech
And looking at the code this is what we need to do to each file. Call process_wav which calls convert_file which calls load_wav which is a method in librosa. And librosa will make calls to numpy and matplotlib. There are things like melspectrogram, a normal spectrogram, Fourier Transform, STFT and reflective padding, which is "Pads with the reflection of the vector mirrored on the first and last values of the vector along each axis."

Like if you have [1,2,3] you have to pad like [2,3,2,1,2,3,2,1,2] or [1,2,3,4] would be [3,4,3,2,1,2,3,4,3,2,1]

I got to 48 commits before realizing, I need to understand sound better. In each wav I see these float64 values from -1.0 to 1.0 and that's sound. But can someone break it down for me like the sound of a human saying "Hello" over the course of 1001 milliseconds would be 0.5297476 then 0.98376515 then -0.11284 then etc. What do these numbers mean in terms of the sound my ears hear?

---INSERT MONTAGE OF HOURS OF GOOGLING
The answer is: think of a flow of numbers coming in like 0,1,2,3...7,8,9,10,9,8,7...3,2,1,0,-1,-2,-3...

So starting at zero we go up step by step until 10, then back down to 0, then down to the negatives (step by step) until -10.

-10,-9,-8,-7,...-3,-2,-1,0,1,2,3...

Over and over, this flow keeps flowing and sometimes it flows fast, sometimes slow. And once you grok this, understand this is a tuning fork vibrating a perfect pitch note.

And these "steps" it takes from like .1 to .2 or .3 to .4 wow there are a lot of baby steps in between them, like "infinite."

0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,0.9,0.8,0.7, 0.6,0.5,0.4,0.3,0.2,0.1,0.0,-0.1,-0.2,-0.3

The above is the same example with 1 to 10 but the real 0.1 to 1.0 floats you will find for real.

0.0,0.1,0.15,0.2,0.3,0.4,0.5,0.51,0.52,0.6,0.7,0.8, 0.9,1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.0,-0.1,-0.2,-0.3

Notice the 0.15 and 0.51, 0.52 baby steps and you'll see where this evetually gets to "a human saying 'Hello' over the course of 1001 milliseconds would be 0.5297476 then 0.98376515 then -0.11284 "

Sound is just this flow of numbers fast or slow, in perfect tuning fork wave form up, down, up down, up down, very rythmic OR not that. Messy and the opposite of a tuning fork like spoken words.

Ok, I feel like I understand sound better. I can visualize an A note or a C note as these numbers flowing in. This kid is amazing. And I took the sound of his videos and a few others like this one and started to practice "looking" at their waves.

You can follow along at github.com/many-pw/tacotron.