AI Visuals for Japanese Mnemonics

Andrew Gentry

August 26^th 2024

Introduction

This article turned out longer than I anticipated. If you take the time to read it, you’ll find fragments of my colorful career and my general attitudes towards building things. If I’ve managed to do the writing part of this whole project alright, hopefully I can convey a mindset that (in some ways at least) aligns with the philosophies of Van Neistat and Matt Crawford's "Spirited Man."

The intention of this article is to share my examination and execution of a minor product improvement to a website that I’ve been having a lot of fun with lately, WaniKani. I wanted to finally write about the ways that I find myself balancing technical possibility with realistic outcomes. It’s a chance for me to illustrate the way I work, in my career and with side-projects too. In the sections that follow, I’ll talk about some AI experiments, cloud infrastructure, python scripting, Javascript fun, and even a little math. Beyond just the technical breakdown, however, I've also found some habits for thinking through product validation, design, and business value, and I felt it important to talk about those parts of my process too.

WTF is WaniKani?

For some time, I’ve wanted to travel to Japan. More specifically, I’ve wanted to ride a bike across Japan. For almost 2000 miles, I’d like to witness the changing landscapes of mountainous onsen towns, coastal fishing villages, sprawling urban cityscapes and do it all supported by my own power and willingness to travel on a bike I built by hand.

When this idea first took shape, I decided to try learning some kanji—the Japanese writing system. I never expected to become fluent in Japanese, but I thought recognizing common characters might help me navigate a foreign land on my bike. (What’s the kanji for "bike shop"? "Water"? "No bikes allowed"?) This is where WaniKani comes in.

WaniKani is a Spaced Repetition System (SRS) for learning Japanese kanji, radicals, and vocabulary. It’s created by a group called Tofugu and is easily one of the web applications I’ve enjoyed most as an adult. WaniKani teaches kanji by first introducing “radicals,” the building blocks of the characters, and then moving on to “kanji,” the words these characters typically represent. Each kanji usually corresponds to a single word and has a “reading,” which is how it’s pronounced.

For example, the kanji for “mountain” looks like 山 and is pronounced “san.” WaniKani provides this mnemonic to help you remember that.

Mnemonic:

Think about mountains talking to each other, calling each other by their names and adding the Japanese name-ender san (さん) to each of their names. "Hello, Everest-san." "Oh hi, Fuji-san."

My Niche Problem

Despite my praises for the WaniKani system, often the mnemonics that WaniKani provides to accompany the readings fall short for me. Sometimes, the mnemonic is just not memorable. Occasionally, they use a cultural reference I don’t know. This hasn’t been detrimental, since I can easily turn to AI for a good mnemonic. The more I leveraged ChatGPT for mnemonic generation, the more curious I grew about how others have used AI for their WaniKani experiences. What grabbed my interest most were some projects like this one that took the AI mnemonic idea a step further and generated images to go along with the Kanji.

The existing projects I found, however, all seemed to have fallen short in a few ways.

The DALL-E models used by older projects were outdated. DALL-E 3 is available now, and, from what I can see, is significantly better for this application.
Producing images for all 2000 kanji was done manually, seemingly through some crowdsourced discord chats that I can’t access. I felt there should be a way we could generate all the kanji images ahead of time, and then build in a voting/QA system to improve on a baseline.
I found accessibility for other projects to be limited. It appears that the solutions out there would not have mobile support, and required multiple downloads and configuration. Developing dedicated web pages and/or a more user-friendly extension could enhance both adoption and usability.

A Product Management Aside.

This project, like most things I build, began with play—experimenting with the latest AI tools without a clear agenda. I started by asking myself, "Can this really be done... and done well?" This exploratory phase is an important part of engineering, and often one of the most fun. The point is to become familiar with the tools' features, possibilities, and capabilities before jumping into the design process. Not only does this help establish a baseline for technical feasibility, but it also opens up space for ideation, often leading to the discovery of additional features worth building.

Channeling that excitement at the start of a project has always been a challenge for me. Ideas quickly spiral, especially when you're an engineer predisposed to action and execution. After all, building cool stuff is what I’m good at, right? A thought strikes, you see an elegant technical solution, and off you go!

But in that rush to build, some questions get left in the dust, such as:

Is there already a solution?
Is this a problem for others, or just for me?
How long will it actually take to build? (Answer: longer than you think)
How will your solution integrate with existing workflows?
Does this technology help people?
Is it profitable?

You don’t necessarily need to answer all—or any—of these questions before starting something fun. It depends on how much product management you apply to what you're building. In my case, this project is an experiment, so profitability and timelines aren't concerns. If you’re working for a company owned by private equity, short-term profitability might be the only thing that matters. In a startup or client-facing role, the focus might shift to landing business or capturing users. The point is, there’s a sliding scale of restraint to apply to ideas before spinning up infrastructure and sinking in engineering time. Establishing what that scale is before you start can really pay off.

For this project, applying some basic principles of product management helped me avoid building unnecessary features. What could have been an afternoon project lost to a dusty GitHub repo turned into a legitimate tool. By considering the problem I wanted to solve and validating gaps in existing solutions, I was able to stay on track and finish the project with the following outline:

Explore AI tools and functionality for generating images from mnemonics.
Validate that AI images are high-quality and worth integrating with a WaniKani workflow.
Validate that I can create thousands of these images somehow.
Validate we can create a solution with minimal cloud infra and minimal software development.
Build an MVP (Minimum Viable Product).
Interact with the WaniKani community to understand the highest priority 2nd round of features.
Build a feedback system.

Let’s Play with AI

Custom GPTs

I was already familiar with OpenAI’s Custom GPTs, having used them in a few experiments at StickerGiant. These AI assistants come with pre-programmed agendas that help make prompting more focused. The functionality offered by Custom GPTs seemed like a good starting point for getting an AI to focus on the specific task of generating mnemonic images. It only took a few iterations before my AI started producing reliable, high-quality images.

My “instructions” to the GPT followed something like this:

This GPT creates memorable images that help users remember mnemonics that are associated with Kanji. The user will provide input in this format:

name: (kanji meaning)
mnemonic: (the descriptive mnemonic)

The GPT will then take both the name and the mnemonic together to generate an image that depicts the mnemonic scene and if possible ties in the name somehow as well. If possible the image should have the Name written somewhere in the image. The GPT should produce an image that incorporates the aspects of the mnemonic and the name into something that is produced in a random style.

Output Goal: An image that depicts the scene.

‍

The results were immediately excellent. As if the mental images I had pictured in the past while learning these Kanji had been dragged right from my mind onto the screen. Okay maybe that’s dramatic - but the scenes it produced really exceeded my expectations.

For example, the reading for the word “Two” in Japanese is pronounced ”Ni”:

Example chat with AI to produce a mnemonic image for the word Knee

Not only did the GPT produce an image that depicted the mnemonic, but smartly added the word “Two” into the image in an (almost) subtle way. The fact that DALL-E is this good at getting text into images is a major improvement over generative AI’s I’ve used previously.

OpenAI Assistants

Despite the promising results from the Custom GPT, I wanted to find a way to produce these images systematically rather than through a text prompt in my browser. If I want to find a way to produce all 2000 kanji images, I will need to find a tool that I can program a script to work with. As far as I could tell, the Custom GPT feature doesn’t have access via API. To interface with the AI programmatically, the Assistants API feature is more suitable. As of writing, OpenAI provides both Python and Node SDKs are available to program against.

Sadly, compared to the Custom GPT I built earlier, the Assistants feature seems to have more limited capabilities. While the Custom GPT was smart enough to understand its role and leverage DALL-E in the background to generate images, the API-based Assistants have specific inputs and outputs. The “Chat” function of the Assistant could produce text and even suggest prompts that could be handed off to DALL-E. But to actually generate an image as a response to an API request, I needed to use the dedicated DALL-E SDK.

Feeding mnemonic prompts right into the DALL-E tool did not yield great results either, since the context of the mnemonic doesn’t exactly lend itself to a specific picture. This is okay though, since I could just script the output of the Chat Assistant directly into the DALL-E assistant, and get (hopefully) similar results.

‍

This is what I had assumed the Custom GPT was doing under the hood. So, chaining these tools together felt promising to me. However, after running some tests, I started to question that assumption.

The same mnemonic for “Two” (Pronounced: "Ni")

‍

Poor AI mnemonic image produced after first attempt

‍

Not exactly memorable, or relevant. Nothing in this image seems to have anything to do with “Knees” or even the word “Two”. This is one example, but overall, the ability to depict a clear scene similar to the mnemonic seemed to have broken down with my manual chain between the two tools. What could be going wrong?

A brief excursion into Prompt Engineering

It was at this point that I felt out of my depth. Images were definitely being generated, and seemed to (sometimes) have some loose connections to the mnemonics that came in as an input, but I was not very excited by them. I found myself feeling as though I was working against the AI instead of with it to produce valuable results. Is this what the kids call “Prompt Engineering”? I reckoned to find out by reading some articles on the subject and come back with some tactics to apply.

Step-By-Step Thinking

Asking the system to work through its process leads to better results. In my case, this seems like an example of a process that has several intermediary steps. We are making some leaps from one task to another to produce an end result.

Take a word and a mnemonic from the User.
Identify the key characters and actions that are happening in the mnemonic.
Come up with a way to visualize the characters and their actions (Sometimes a leap needs to be made based on a sparse mnemonic).
Transform that visual representation into a prompt that will produce something useful when given to DALL-E

Specificity around Image Composition

The most influential change I made to my prompts was adding instructions that emphasized describing character placement and scene mechanics. Telling the Chat Assistant to produce prompts that focused on what people were doing and how they were physically oriented while doing it turned into better image results in the end.

‍

Bad Prompt

Imagine a whimsical scene of cartoonish characters counting each other's knees with an expression first of confusion, but then they eventually change their expression to loving certainty.

‍

Good Prompt

Two strangers face each other in the street. One stranger presents his knees towards the other while holding two fingers up. The scene is set on a sidewalk in the city in the daytime.

‍

Prefer Positive Over Negative Prompts

I also encountered some challenges with DALL-E producing images that contained many nonsensical Kanji characters as a part of the image. When the prompts included information about Kanji, DALL-E would produce kanji-looking symbols as a part of the image. I tried solving this first by including instructions to “Never produce Kanji characters”. What turned out to be more productive however, was engineering prompts to positively reinforce people, animals and scenery instead of written phrases.

Walk Away

One approach that’s often overlooked in “prompt engineering” guides is the tried-and-true practice of putting some distance between yourself and the problem. Many programmers are familiar with the experience of taking a break when stuck on a problem. If you're interested in sleep science, you might know that sleep helps us perform and learn better by reducing the noise of other stimuli accumulated throughout the day. My take is that a similar thing happens when you walk away for a bit. I’ve noticed that my own prompts to the AI can become stale and rigid as I tweak minor details in search of better results. If you find yourself hitting a wall, my suggestion is to walk away—listen to a podcast about sleep or just take some time away from the problem. Usually, you’ll discover a new voice, a fresh approach, or a different way of crafting your prompts that leads to better results.

File Search Tool - Prepared Vector Data

There was another tool that played a crucial role in generating quality images through my AI experiments—the file_search functionality for assistants, which is also still in beta. As I understand it, this feature enhances the assistant by allowing it to access and query pre-provided data efficiently. Files or documents uploaded in this way are processed to enable quick searches and queries by the assistant.

I used this feature to teach the AI about all the mnemonics and mnemonic hints in advance. In theory, this should allow the chat assistant to quickly look up a mnemonic based on a single word I provide, rather than having to query both the word and its mnemonic together each time.

Idea: Consistent Recurring Story Characters

As you progress through the WaniKani levels, you'll encounter a few recurring characters in the mnemonics. This is because many words share identical readings, and there’s often a common association between certain readings and specific subjects or words.

For example, WaniKani uses “Ms. Chou,” a troublesome older lady who haunts the streets, as a recurring character. “Koichi,” based on one of the WaniKani creators, also appears frequently and is depicted consistently across many levels.

It would be interesting to experiment with uploading pre-defined descriptions of these characters and instructing the assistant to depict them consistently. The goal would be for each character to maintain a similar appearance across the various images generated, creating a cohesive visual narrative as you advance through the levels.

Finally, Some Good F***ing AI

Having identified the right tools for the job, optimizing my prompting tactics, and leveraging some advanced new features, my results began to resemble the first exciting experiments I tried with Custom GPTs. Now when I fed in the mnemonic for the Japanese reading of “Two,” the AI started producing results like this:

‍

Better AI mnemonic image produced with prompting tweaks

‍

This checks the validation box for me. This image clearly illustrates mnemonic from WaniKani, and is the result of tools that I can access programmatically. I’m happy with the style too (even if the man in the picture above has two left hands).

Scripting AI to generate lots of images

As mentioned earlier, the idea of generating AI images for WaniKani mnemonics isn’t new. One challenge I've noticed with similar projects is the approach to creating a complete collection of images. The WaniKani system includes over 2,000 kanji alone. If I eventually want to tackle vocabulary and radicals, that number grows to well over 10,000 images just for readings. I want to avoid the time sink of manually generating each image. I’d also like to maintain a level of human QA on what the AI produces (more on the QA aspect later). If I can build confidence in the image generation pipeline, I should be able to systematically crawl through every kanji and automate the image creation.

To begin testing out systematic image generation. I wrote a simple python script with this pseudocode:

1. import OpenAI libraries
2. iterate through list of 2000 kanji
3. For each kanji, ask the Chat Assistant for a really good DALL-E prompt
4. Take the DALL-E Prompt to DALL-E and ask for an image
5. Download the produced image to my desktop

！Quick Math
The images that DALL-E produces usually clock in at a whopping 2MB.

2000 × 2 MB = 4,000,000 MB ≈ 4TB of images to download.
‍
My 1 TB machine will run into issues with that much data, so it might make more sense to put those images through some kind of downsizing pipeline, send them to an external drive or run this process in batches.

Collecting WaniKani Data

Of course, to iterate over the entire WaniKani Kanji collection, I need to retrieve that list somehow. Luckily, the WaniKani API lets us do this with a single REST call.

Since I plan to focus on Kanji for the MVP, I’ll specify kanji as the only subject type we’d like from WaniKani with this GET call:
‍

curl --location 'https://api.wanikani.com/v2/subjects?types=kanji' \
--header 'Wanikani-Revision: 20170710' \
--header 'Authorization: Bearer <api-token>'

This response includes the first 1000 Kanji paginated. With this data, I now do some manual pre-parsing to make the file a bit more manageable. If I find some more time, it may be valuable to automate this process as well.I first run the file through this JSON Formatting Tool.

What is great about this tool is that it supports a query language called JMESPath that lets me pare down the JSON response from WaniKani into fields I care about.

The query I used in particular is as follows:
‍

data[].data.{Meaning: meanings[0].meaning, Mnemonic: reading_mnemonic, Hint: reading_hint}

This produces a JSON file with the following format:
‍

[
  {
    "meaning": "One",
	"reading_mnemonic": "As you're sitting there next to One, holding him up, you start feeling [...]",
	"reading_hint": "Make sure you feel the ridiculously itchy sensation covering your body. It [...]"
  },
  {
	"meaning": "Two",
	"reading_mnemonic": "How do you count to two? Just use someone's knee (に), and then their [...]",
	"reading_hint": "Imagine asking to borrow the two knees of a stranger in the street, just [...]"
  },
  ...
]

This is the JSON file that we pre-populate the assistant with. By saving these in the Assistant’s vector store, our script will only need to iterate over the kanji words themselves, and the assistant will already have context on which mnemonic goes with that word.

The JMESPath query to generate the Kanji word list is simply:

data[].data.meanings[0].meaning

Which produces a clean array like this:
‍

[
  "Stone",
  "Ten Thousand",
  "Now",
  "Origin",
  "Inside",
  "Part",
  "Cut",
  "Noon",
  "Friend",
  ...
]

Putting It All together

After all the coding montages were complete, my final python script looked something like this:
‍

from openai import OpenAI
import requests
import csv   
import os
import time
import kanji_list
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_prompt(client, kanji):
    try:
        print(f">>>>> Generating Prompt for {kanji}")
        # Initialize OpenAI client
        thread = client.beta.threads.create()
        
        # Add a message to the thread with the Kanji to generate
        client.beta.threads.messages.create(
            thread_id=thread.id,
            role="user",
            content=kanji
        )

        # Run the thread    
        run = client.beta.threads.runs.create_and_poll(
            thread_id=thread.id,
            assistant_id=os.environ.get("OPENAI_ASSISTANT_ID"),
        )

        # Wait for prompt to finish
        if run.status == 'completed': 
            messages = client.beta.threads.messages.list(
                thread_id=thread.id
            )
            prompt_text = messages.data[0].content[0].text.value
            
            return prompt_text
        
        else:
            raise Exception("Prompt Generation Failed - Status Incomplete")

    except Exception as e:
        print(f"Error generating prompt. {e}")
        raise

def generate_image(client, kanji, dalle_prompt):
    try:
        print(f">>>>> Generating Image for {kanji}\n")
        image_response = client.images.generate(
            model="dall-e-3",
            prompt=dalle_prompt,
            size="1024x1024",
            quality="standard",
            n=1,
        )

        image_url = image_response.data[0].url

        img_data = requests.get(image_url).content
        with open(f'temp/images/{kanji}.png', 'wb') as handler:
            handler.write(img_data)

        return image_url

    except Exception as e:
        raise Exception(f"Error generating image. {e}")

def log_kanji_to_csv(kanji, prompt, image_url):
    fields=[kanji, prompt, image_url]
    with open('temp/kanji_log.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerow(fields)

def log_kanji_problem(kanji, e):
    fields=[kanji, e]
    print(f'{e}')
    with open('temp/kanji_errors.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerow(fields)

# Main execution
error_count = 0
            
# Calculate time taken
start = time.time()

if not os.path.exists('/temp'):
    os.makedirs('temp')
    os.makedirs('temp/images')

for kanji in kanji_list.kanji:
    if error_count > 10: 
        break

    try:
        prompt = generate_prompt(openai_client, kanji)
        image_url = generate_image(openai_client, kanji, prompt)
        log_kanji_to_csv(kanji, prompt, image_url)

    except Exception as e:
        log_kanji_problem(kanji, e)
        error_count = error_count + 1

end = time.time()
length = end - start
print(f'Total Run Time: {length}')

Scoring Test Runs

Remember when I mentioned QA earlier? I realized that I wouldn’t have the time or bandwidth to review all 2,000 kanji images and compare them to their mnemonics. Instead, I opted for a sampling approach, focusing on the first few levels of WaniKani. If the first 50 images produced by my pipeline were (subjectively) satisfactory at least 80% of the time, I could trust the process. While there’s more to be said about building a voting or feedback system later on, that’s out of scope for now. My immediate goal was to generate a test run where most of the images were good enough.

There are about 50 kanji in the first two levels of WaniKani, so I applied this sampling method to those levels. I ran the script and watched as the images appeared, laughing quite a bit along the way.

To evaluate the images, I assigned a subjective score between 1 and 10 to each one. Images that couldn’t be produced due to content moderation issues or other errors received a score of 0. I rated images higher if they clearly depicted the mnemonic, felt memorable, and even higher if they incorporated references to the actual word being depicted—whether through clever scenery, literal text, or other creative elements like speech bubbles, signage, or name tags.

My first run was done with an earlier version of my assistant - (did not use prompt engineering tactics described before).

‍

Bar graph of image scores from first batch of AI images — **Average:** 6.151
**Above 6:** 49.05%
**Above 7:** 33.96%

‍

I then did a run on the same two WaniKani levels with my prompting improvements applied.

‍

Bar graph showing scores of second batch of images — **Average**: 7.321
**Above 6**: 81.1%
**Above 7**: 58.49%

‍

At this point, I have more than half of all images being produced scoring at an 8 or higher. While I would love to optimize this even further, I took this as a clear signal to not get too hung up on the prompting and to proceed on with the outstanding software development to be done.

This testing process, though subjective, was a great step for me to apply some kind of quantitative metric to my results. The statistical relevance of scoring 50 images helps drown out my own personal bias and illuminate a trend in the overall quality of the pipeline. Plus, If I want to go back and improve the image generation procedure or tweak the prompting again, I can run this same test on new results to validate that there hasn’t been a regression.

Architecture - An exercise in paring down.

Having established that image generation was not only possible, but consistent enough in a bulk setting, I then turned my attention to the system for displaying these visual mnemonic images alongside the WaniKani experience. I know that placing the images inline with WaniKani lessons or even just through my own public page will require hosting the images. To tease out what types of other infrastructure I may want to stand up as well, I moved on to diagramming some relationships. This process is helpful for identifying over-complexities, and for getting a dose of reality around what kinds of maintenance and engineering I’m signing myself up for. Once again, double-checking assumptions and looking at sobering questions about the commitment are always helpful in the long run.

My first infra diagram was admittedly naive with a Client-Server-DB structure. I assumed we would need to map information about the Kanjis to their respective image, and provide an API to any client that wanted a reference for the images.

Overkill.

Diagram of first attempt at cloud architecture.

After being overwhelmed by the thought of standing up servers, functions, CDNs, and databases for just the back end, I realized that the map between Kanjis and their image location is not worth a database. In fact, a simple table could be stored on the client - eliminating the need for a DB/Server/API at all.

The second draft looked like this:

Diagram of second attempt at cloud infrastructure

Though, in practice, the “lookup” to the images could simply be an implicit relationship by the name of the image in the bucket itself. So long as images in the bucket are consistently named and reached at https://bucket-address/<kanji>.jpg then images can always be found by virtue of their name being a key. As I write, this feels painfully obvious, but required me diagramming this out and starting to build the client to actually put this thought to paper. That is why it’s less expensive to diagram first.

Furthermore, if I did find myself later on wanting more information in our application (eg. we decide to include the original mnemonics next to the images in a webpage). We can rely on the existing WaniKani API to make those references the same way with an implicit key to the Kanji. Just make another API call to WaniKani to retrieve info about <kanji>.

I did opt to include one premature optimization in the architecture. Placing a CDN in front of the Image bucket would ensure fast image delivery wherever we use them, and additionally give me the peace of mind of a WAF to prevent bad actors from running up my cloud bill.

My final infra design then looks like this:

Diagram of final cloud architecture design.

Digesting this architecture for a hobby project like this is now more reasonable. Setting up a S3 bucket with a CDN is quite simple. This architecture was something I was able to get running and tested in less than 30 minutes. If the infrastructure was more complex, or if I were building this for a client, I would likely decide to stand this up using IaC. For my purposes, the AWS GUI was just fine. Engineering like this is a never-ending tight-rope act of doing it right and getting it done.

Design Considerations

The effective “back end” is complete. Images can be accessed simply by asking for the Kanji name, and using our AI script, we can count on “good” images being available for us when we want them. The next consideration is where/how do we want them?

What was the goal of this tool again? Oh yeah,

Provide mnemonic images as an aide in memorization.

We want to provide the user a chance to associate these ridiculous images to the readings that are being learned in lessons. Also, because lessons only happen once, it would probably be a good idea to offer a way to revisit these images too after the fact.

WaniKani’s UI tabs through different parts of a newly learned Kanji on pages that look like this:

‍

Example webpage from WaniKani lessons page.

‍

The information around the kanji reading and its mnemonic is presented here. To effectively surface our images during the learning process, we need to integrate them at this stage. For post-learning reference, two options come to mind.

One option is to build a website that organizes all the WaniKani kanji into a gallery, displaying the corresponding images. This solution could include search and filter capabilities, giving us complete control over the UI and endless customization options.

Alternatively, we could leverage WaniKani's existing reference dictionary of kanji. These are sorted by level and color-coded to indicate which kanji users have and haven’t encountered yet. Even better, this feature is already built into the WaniKani site, making it more seamless for users (the idea of asking users to leave WaniKani to view the images seems cumbersome to me).

Given our goals, a browser extension that manipulates the WaniKani web pages under specific conditions seems like the most practical solution. This extension could allow users to view our AI-generated images directly on both the lessons page and within the WaniKani dictionary. Ideally, we could inject images smartly or offer options for users to view them through specific actions.

Of course, these are just my thoughts as the developer. In a business setting, the next step would typically involve user studies, focus groups, or research to validate what makes the most sense from a product perspective. However, since this project is for charity, I have the freedom to shoot from the hip.

Test Designs Using Inspect Element

One of the best uses I find for browser developer tools is the ability to manipulate the DOM and test our UI tweaks against live webpages. I spent some time doing this on both the lessons and dictionary pages, experimenting with various options that not only met the functional needs but also respected WaniKani’s style and minimal UI.

‍

Hypothetical UI design for AI images on WaniKani dictionary page — **Dictionary Page** - Added a Visual section and a modestly sized image based on the kanji.

‍

Hypothetical UI design for AI images on WaniKani lessons page. — **Lessons Page** - Added a visual section with a button to trigger the image shown in a modal.

Building a Chrome Extension

Browser Lock-In

I hate the idea of locking this solution into one browser. Personally I use Firefox, so even the process of developing for Chrome falls outside of my own workflow. That being said, I know enough from my work in e-commerce and web that Chrome is likely going to be the most common browser for WaniKani users. Safari might be up there too, but I figure it is enough to start with Chrome, and then investigate the difficulty of porting the extension over to other browsers later (mobile anyone?).

Reverse Engineering

Building a browser extension to manipulate someone else’s webpage is fun, but it can also be a bit of a headache. Depending on the task, detecting pages, specific content, and events on an external webpage requires a level of reverse-engineering the site we’re manipulating. For this project, I needed to detect the page a user was on (whether they were viewing a lesson or the dictionary) and then inject elements into the DOM so that the UI appeared as if WaniKani itself were serving it.

The first challenge with this approach is that WaniKani—like most modern web applications—is a Single Page Application (SPA). This means that, as far as the browser is concerned, only one page is ever loaded, while all the navigation within the site is handled by the UI framework it’s built on (e.g., React, Angular, Vue). These frameworks are fantastic for frontend development and have become standard, but they force our little JavaScript spy to be clever about figuring out what’s happening on the page.

The second problem is not so much a problem, but more of an inherent risk with building an extension like this. All over the place are dependencies that are being laid on WaniKani’s system. From detecting page elements via CSS Selectors, to the Mnemonics themselves. Much of this tool is reliant on what WaniKani looks like at this instant in time, and cannot make any assurances to hold up in the future. What happens if they change URLs, obscure their CSS classes, or refactor their UI completely? Short of convincing the creators at Tofugu to incorporate my images into WaniKani directly, there is little we can do about this issue. I mention it here as something worth recognizing and preparing for when building tools like this.

The Code

For the programmers out there - here is the Source Code

I used a combination of MutationObservers and EventListeners to handle WaniKani page detection. Aware that performance can be a concern when observing DOM elements, I made an effort to write my content script with efficiency in mind, minimizing potential memory leaks. To create the UI elements, I directly instantiated document elements and applied the appropriate styles—sometimes matching WaniKani’s design, other times using my own. If I were more concerned about WaniKani changing their class names, I might have opted to create and rely solely on my own classes, rather than assuming theirs would remain consistent. I’m certainly no expert in JavaScript, so I’m sure there’s room for improvement in other areas as well. If you’re the type to contribute, feel free to check out the repository. 🙂

Publishing

Publishing a Chrome extension is honestly a bit of a chore, primarily due to the tedious process of creating all the required icons, screenshots, descriptions, and other marketing materials that Google demands. Additionally, your extension goes through a review process to ensure it meets their standards and doesn’t engage in any malicious activity—understandable, but still an extra step. If you’re interested in learning more about the Chrome publishing process, you can visit the official guide.

What’s Next?

With great software comes a great backlog. I just made that up, but it’s true. There are so many more things that I can think of adding to this project that would take it to the next level. Here are the big three that come to mind right away:

Image Voting System

I want to push the quality and usefulness of the images as far as theoretically possible. However, as long as I’m the only one evaluating their effectiveness, I’ll never hit that goal. I think there are some lightweight ways to integrate a feedback or voting system into the extension, however. I’m imagining some sort of thumbs up/down buttons next to images when they’re viewed in full screen mode. Tying those responses to a table that ranks the most upvoted and down voted images, I could identify which ones need tweaking or manual prompting.

Porting to more browsers (mobile?)

I strongly dislike the fact that the nature of a Chrome extension limits this tool to a single browser. I don’t have any WaniKani analytics to tell me what the browser demographics are, but my guess is that it would be nice if we could use this tool in Safari and on Mobile.

AI Mnemonics in notes

One of the original problems this journey aimed to solve was instances where the WaniKani mnemonics weren’t great. I think it would be a nice compliment from an extension/AI perspective to add mnemonic generation suggestions in users' lesson notes when they are learning Kanji - providing an alternative to the WaniKani mnemonics inside of the same extension.

Add Your Input!

I would love to hear from others about what you think would be the most valuable next addition to the project. Depending on the interest and what the WaniKani community thinks, I will go ahead and do another post like this one on the next feature that gets built out.

If you’ve made it this far, I would love to hear your thoughts on this breakdown. I’ve set up this nifty form to collect feedback on the process, the product, thoughts on new features and room to subscribe to any future posts like these.

Links & Resources

Source Code (Extension + Image Generation) - Here

Download the Chrome Extension- Here

Custom GPT (requires OpenAI subscription) - Here

AI Visuals for Japanese Mnemonics

Andrew Gentry

August 26th 2024

Introduction

WTF is WaniKani?

My Niche Problem

A Product Management Aside.

Let’s Play with AI

Custom GPTs

OpenAI Assistants

A brief excursion into Prompt Engineering

Step-By-Step Thinking

Specificity around Image Composition

Prefer Positive Over Negative Prompts

Walk Away

File Search Tool - Prepared Vector Data

Idea: Consistent Recurring Story Characters

Finally, Some Good F***ing AI

Scripting AI to generate lots of images

Collecting WaniKani Data

Putting It All together

Scoring Test Runs

Architecture - An exercise in paring down.

Design Considerations

Test Designs Using Inspect Element

Building a Chrome Extension

Browser Lock-In

Reverse Engineering

The Code

Publishing

What’s Next?

Image Voting System

Porting to more browsers (mobile?)

AI Mnemonics in notes

Add Your Input!

Links & Resources

August 26^th 2024