Vocaloid is a very unique type of voice synthesizer software. Initially not meant to be a full commercial product, research on its technology began in 2000. Yamaha then got involved during the development process, and the first version of Vocaloid came out in 2004.
Currently on version 5, the software synthesizes ‘singing’ through painstaking sampling and simulation of human vocal sounds.
What is Vocaloid?
Essentially, it’s software that lets users create faux ‘vocals’ by typing in lyrics and a melody. It uses a specialized form of synthesis with vocals recorded by voice actors/singers.
To make a vocal passage, the user inputs the melody and lyrics on a piano roll. You’d first add the note, then type the lyric onto the note itself. You can then change a whole slew of parameters, like the emphasis on syllables, the dynamics and tone of the ‘voice,’ and you can add other effects like vibrato.
A Singer in a Box
Over the years, many different voice banks have been released for use with the Vocaloid synthesizer. You can think of these like sample packs. Each is sold as “a singer in a box” — a replacement for an actual singer.
These virtual singers have a moe anthropomorphism (a term derived from Japanese manga and anime in which inanimate objects are given a face and personality). Moe is a Japanese slang term for feelings of affection or adoration — often used by fans to describe characters in anime or manga.
The avatars from the sample packs, also called Vocaloids, are sold as ‘real’ characters. Some have even been featured on stage via projection.
This niche software is built for professional musicians/producers as well as casual users.
Japanese artists such as the electro group Livetune and J-pop band Supercell have made use of Vocaloid in their music as vocals. The Japanese record label Exit Tunes also released compilation albums featuring Vocaloids (avatars) as performing artists.
Other artists, such as British progressive musician Mike Oldfield, have used Vocaloids as background ‘singers.’
Since the software doesn’t synthesize sounds from nothing, it requires recordings of native speakers of each language.
Vocaloid originally launched as English and Japanese-only. The first English avatars were Leon, Lola, and Miriam. The first Japanese avatars were Meiko and Kaito.
Vocaloid 3 added Spanish using Vocaloids Bruno, Clara, and Maika; Chinese for Luo Tianyi, Xin Huam and YANHE; and Korean for SeeU.
How It Works
The software’s technology falls into the concatenative synthesis category. It uses short samples of recorded sounds and is very common in speech synthesis.
Concatenative synthesis in the frequency domain splices vocal fragments from human singing voices for recreation in the software. The system can produce semi-realistic voices by adding expressions like vibrato, dynamic emphasis, and more.
When it launched to the public in 2004, Vocaloid called their tech “Frequency-domain Singing Articulation Splicing and Shaping.” As of version 2 in 2007, they dropped the lengthy descriptor — understandably, it may have raised more questions than it answered.
Even though the developers explained ‘Singing Articulation’ as ‘vocal expressions’ like vibrato and any vocal fragments necessary for singing, it isn’t immediately clear how the software works. It’s easiest to think of it as a form of synthesis specifically for replicating human singing voices.
The Vocaloid and Vocaloid 2 synthesis engines were deliberately designed for singing — not text to speech. They developed more software like Vocaloid-flex and Voiceroid for that purpose.
The synthesis isn’t perfect, of course. For instance, they can’t naturally recreate certain singing styles like hoarseness or screaming/shouting. So, there aren’t any grindcore Vocaloids (yet).
Despite its lifespan, this software doesn’t appear to have broken into mainstream music production. It’s seen some popularity and success in Japan, but hasn’t yet made much of an impact on Western markets.
This could be due to any number of factors, though the unnaturalness of it is probably the biggest contributor. Vocaloid is undoubtedly impressive — still, however, it’s incredibly difficult to simulate the nuances of the human voice in a convincing and natural way.