Back to Writing

Dev note: turning any text into something you can hear, read, and shadow — the story behind YomiPlay's Text-to-Speech

Got the text but no audio? YomiPlay's new Text-to-Speech turns any passage into natural narration plus synced subtitles — generated on your phone in seconds, ready for the same close-listening, shadowing, and review as any imported audio.

YomiPlayJapanese LearningText-to-Speech

A gap that never got filled

YomiPlay's original purpose was clear: turn "what you hear" into "what you understand." You import a podcast, a recorded lesson, a video, and the app — right on your phone — turns it into timestamped subtitles with kana, romaji, and translation. Sound that used to flash by becomes material you can close-listen to sentence by sentence, shadow, and review.

But the more we used it, the more we kept hearing the same feedback — and kept hitting the same wall in our own Japanese study:

"I only have the text, no audio. Now what?"

An example sentence from a textbook, a dialogue your teacher sent over, a vocab list you put together, a self-introduction you want to memorize, a really natural expression you saw online… they're all just text. You want to know what it sounds like, you want to read along, you want to hear it in your ears over and over — but there's no recording to go with it.

Until now your only options were: hunt around online for audio (usually not the exact sentence), use the robotic built-in text-to-speech, or just screenshot it and ask your teacher. Every one of them is awkward, and every one of them breaks your study rhythm.

That gap is what Text-to-Speech is here to fill.

What it does, in one sentence

Paste any text, and YomiPlay turns it into natural narration plus synced subtitles — right on your phone.

And then? Then it's exactly the same as any audio you've imported —

  • automatically split into sentences, each with its own timestamp
  • Japanese gets kana and romaji added automatically
  • close-listen sentence by sentence, loop a single line, slow the pace down to shadow
  • save it to your library, group it, search it, come back to review anytime

In other words: it used to be "first you need sound, then you have study material"; now it's "as long as you have text, you can make study material." Any sentence you want to learn becomes something you can hear, read, and shadow within seconds.

What it feels like to use

  1. Tap "Text-to-Speech" on the import page
  2. Paste your text in
  3. Pick a language and a voice you like (several male and female voices — you can preview before generating, so pick the one that sounds right before you start)
  4. Tap generate

A few seconds later it's sitting quietly in your library, and you open it up to close-listen and shadow just like always.

The voice isn't that "robot reading a textbook" tone — Japanese, English, Korean, and Vietnamese use on-device AI synthesis, natural, with a connected flow of speech; Chinese uses the iOS system voice. And the whole thing happens locally on your phone — no network, no upload — whatever text you paste never leaves your device.

A few scenarios you'll actually use

  • Textbook sentences into listening practice: paste in the few example sentences you learned today, generate a clip, and listen and softly shadow on your commute.
  • Your own speaking practice: write a self-introduction or a situational dialogue, generate the audio, and correct your pronunciation against a standard reading.
  • Collecting bits of expression: come across a natural phrase, paste it in on the spot, and build up your own "natural-expression listening library."
  • Word / phrase memorization: turn a vocab list into narration, listen and dictate with your eyes closed — it sticks better than rote memorizing.
  • Reading material with no audio: want to "read an article to yourself"? Turn it into shadowable audio; combining listening and reading makes it stick.

One small thing we care about

We didn't make it a "read-it-once-and-done" reader. The results fold completely into YomiPlay's learning loop — the same playback, furigana, shadowing, and review system as the podcasts and videos you import. Because for a learner, "hearing it" is only the first step; being able to shadow it repeatedly and come back to review is what actually makes it stick.

So from now on, "I only have text, no audio" is no longer an excuse. The sentence you want to learn — just let it read itself to you.


Open YomiPlay → Import → Text-to-Speech, and try a sentence you want to learn today.