By M&E Journal M&E Daily June 13, 2017
M&E Journal: Q&A: Entertainment Localization Technologies
By Scott Rose, CTO, SDI Media Group
Abstract: For years, the localization industry has witnessed significant advancements in technology designed to replace the human effort for voice, text, translation and timing. The fact is, for premium entertainment content, finding the benefit is often as elusive as it is inconsistent. Automation works best in a manufacturing model where the inputs, processes and outputs are repeatable. The localization workflow is repeatable, however the content in the container coming in the door is a random, noisy, one-off assortment of artistic choices. To effectively use these technologies, we must bake in artificial intelligence that analyzes on the fly, and fit scores not only what the content is, but what technology and workflow is most suitable, including what artistic constraints exist
From the outside looking in, the tasks for localizing premium content (series and movies) are fairly simple. For subtitling, you translate the dialog as timeline events that appear on the screen. For voice audio replacement (dubbing), you translate the text, bring actors into a studio to record the script timed to the dialog as lip-sync or voice over, and deliver like any other audio track.
If you are doing multi-language localization, it is common to transcribe the original language and create a template for translation. You may also need to represent text on screen as localized forced narratives or capture them in the dub. By and large, this is a human process with various systems to do the heavy lifting such as manage the process and resources, or create the deliverables. At a high level it looks like this.
Are there technologies that can completely automate every one of these human tasks?
Yes, ideally. Voice recognition and voice to text can create a transcription. Text alignment and voice recognition can create a timed-text template. Machine translation can create the translation. Text to voice can create the dubbed audio. Voice printing and morphing can manipulate the voice qualities. Media OCR technologies can detect text on screen and turn it into text. A fully automated process would look (happily) something like this.
Advancements over the years in each one of these technologies has resulted in some claiming 97 percent to 99 percent accuracy in optimal use cases. In particular, Google’s recent development of deep neural networks AI promises to add a significant leap forward for machine translation accuracy that has been stewing in a phrase-based approach for years.
There have also been some remarkable advancements in the synthesizing of the human voice, producing from text a range of vocal qualities from gender change to implementing natural inflection, coupled with an extended range of languages. This technology has already found value in audio description for the blind. There will always be a quality consideration as companies that engage in this service have done a great job setting the bar. However the cost and timeframe for producing the content, coupled with government regulations mandating it, will see this technology take greater hold.
Voice printing and voice morphing offer intriguing possibilities to reduce the number of actors necessary to cast a production, or match the sound of the original actor’s voice, but the rights to do so are murky. Soon, an actor’s voice print, based on a robust sample size, will be a quantifiable asset. This technology will find its host content.
Why aren’t these technologies widely used for movies and series today?
Every title that comes in the door for localization is a one-off. The consumer experience is guided by a tradition-based expectation of quality that is often unique to the territory. Localization is there to facilitate the linguistic understanding without changing the artistic intent. Any mistake in translation, timing, or audio performance mars the consumer experience. Content owners have a close connection with the consumers and with few exceptions quality control standards are strictly adhered to.
The content itself creates further challenges. Fully mixed audio with music and effects make it difficult to discern human voice for text and timing. Context, juxtaposed imagery, slang, humor, rhymes, lyrics and euphemisms make machine translation drop significantly below 97 percent, particularly when timing is a constraint both for reading speed and visual obstacles in the image. Lip-sync dubbing is an outlier in that it combines the complication of translation accuracy and the matching of an actor’s performance such as the movement of the lips (labials).
The success of these technologies in combination with the uniqueness of what is in the content container tends to follow under the principle of garbage in – garbage out. Without a certain level of consistency and predictability, an assisted, yet still somewhat manual quality control pass is inevitable. So, in comes the human, out goes the happy path of automation.
Surely, even 50 percent accuracy in automation is 50 percent less effort, right?
At the risk of oversimplifying what the value of each one of these technologies provides to a given content, our experience has been that slogging through 100 percent of the content to fix 50 percent of the errors is often as much, and in some cases more, work than doing it from scratch. (It takes twice the effort to delete and type as it does to just type.) Humans bring their pride and experience to the task, and having experienced translators “fix” a machine translation, adjust timing, fill in the missing bits, and anguish over consistency is far less rewarding than creating it whole.
A very different kind of resource, with a specific skillset and training method, is required. Adding to this is the processing time to engage each of these technologies in the workflow, across languages and regions and resources, while weeding out the garbage. Doing all this without creating efficiency debt is difficult. But there’s hope.
What is the path forward?
For short form, consumable, or live content, where quality expectations are more forgiving, these technologies have already taken hold, and will continue to do so. For premium content (movies and series), we need a much greater level of finesse to help with the decision making. We need to insert artificial intelligence (AI) into gaps to manage exceptions, and analyze or pre-qualify the contents of the container to predict the outcome before engaging the technology. It is a smart workflow based on the analysis of the data collected for each one-off piece of content. The workflow would include any combination of automated and manual tasks.
Looks complicated? Yes, it does. But isn’t automation supposed to reduce complication?
Think of a circuit board: very complicated to look at, but when it is out of sight doing its job, no one thinks about it. For this scenario to work, the technologies need to be integrated with the workspaces, which need to interact, suggest and assist the user. Project management and orchestration systems are in themselves automation facilitators as they manage the I/O, tracking and exception handling to avoid the human work of managing a process. Content classification is fairly straightforward, tagging the content based on client, type, genre, service level agreement, etc.
Workflow orchestration is an established technology to move content through the workflow based on business rules. The hard bit is the artificial intelligence task of content analysis. AI answers what the nature is of the content in the container, and what to do with it based on what level of confidence it has in the automation technologies to output something useful. In sum, to successfully deploy these technologies on content that is complex and individually unique, one must seamlessly orchestrate a combination of human and automated tasks in a highly-integrated system.
Will it be possible in the future for a franchise movie or series to be subtitled and dubbed in 40 languages without human intervention?
Yes, but you can bet it will not be to the satisfaction of everyone. Admittedly, to illustrate a point, I have over-simplified much about the success and failure of these technologies as they relate to localization services. Their adoption will be an evolution based on the advancement of the individual technologies themselves and the market forces that are sure to require that content get localized faster and less expensively. Any adoption will be relevant to the quality expectations and the content in the container. Quality itself is defined by the consumer and there is an assumption that a future generation may have a very different idea about what premium entertainment content is, and how they want to experience it.