Deep GUIs

Code / Model / Data

Introduction

As a mechanical engineer who transitioned into software, I’ve always been intrigued by the way stuff breaks down with bits. Friction wears things down, inertia keeps things going and passive stability maintains balance. Software, however, is unforgiving and brittle.

It’s pretty clear that this is going to change with LLMs. Some even say that language is becoming the new UI. It’s a compelling idea, but until we all have neuralink implants or AIs completely take over work, visual computing will always have an integral role to play in efficiently cramming piles of data into our brains.

This brings me to an intriguing aspect of generative AI that I haven’t heard much about since the release of Stable Diffusion (SD) and ChatGPT: generative GUIs. To clarify, I am not referring to LLMs creating frontend code that is then compiled into a website - a process that, while impressive, still carries the characteristic brittleness of software. Instead, I’m thinking about GUIs that are generated and customized in real-time, responding dynamically to user inputs of various kinds (not just text).

Subtle signs of this transition are already present. Several AI products are chipping away at traditional graphical workflows, integrating language as a central component:

Adobe's Firefly
Diagram's Genius within Figma

These, however, are focused on specific use cases. Also, most of the ways the user interacts with the computer, inside or outside the app, is still bounded to GUIs designed by developers. Generative GUIs, on the other hand, would be entirely generated on the fly.

Below are some diagrams that clarify how this would differ from other similar approaches. The wider the block, the more applications it covers at that stack level, and the taller it is, the more it chips away at traditional software for a given application. ChatGPT would fit at the top-left corner of the diagram: it can be used for a wide variety of applications at the business layer, but it’s still limited to text outputs so it does not cover much of the presentation layer. Also, it requires some rendering code to be written to display the text properly, hence the bridge to the display layer. Something like Midjourney, would fit at the top-right corner: it covers a small set of applications (image generation and editing), but it does it end-to-end through the business and presentation layers. GUIs generated with LLMs and subsequently rendered are a bit of a mix. Generative GUIs however, would cover most of the stack: wide variety of applications, end-to-end.

Text GenAIs
Image GenAIs
Code GenAIs
GUI GenAIs

As you can see, I’m using the world GUI very loosely here. In some sense it would be more accurate to talk about generative apps since the diagrams incorporate the business layer, but this term is too ambitious for this tiny project.

Toy example

Building something like this is pretty massive. Big and ambitious companies only very recently managed to get LLMs to work sufficiently well, and visual data is at a much lower level of abstraction than text: modelling semantics at that level in order for it to do something useful is much harder. So, I decided to start with a toy example: mimicking my Mac’s desktop and mouse. This was hacked in a night (+ some training iterations), so it’s not perfect and rather simple. But, I think it manages to get the point across.

I capture my mouse’s x and y coordinates, I bin them to a 20x20 grid, format them as text, pass them to Stable Diffusion and train a ControlNet to generate the correct mouse position on my desktop:

It took a couple of iterations to get it to work (binning, coordinate formatting, curriculum learning), which I’ll cover below. Here is the resulting model at work, running on an A10 GPU, driven by my local machine’s mouse, in near-realtime:

The inference script is currently pretty slow, so here is a sped up version of one session I recorded to drive the point home:

Think about it: a model taking in raw human input and rendering structured, semantically stable, and useful visual feedback. No lines of frontend code written, no rendering engine, and failure is soft (e.g. mouse with a slight offset).

Details

I’ll briefly go over the details of how I got this to work for anyone wanting to try it out (it’s really easy). In essence, this experiment is just a ControlNet trained on all 20x20 mouse positions on my desktop.

Here are some pretty funny phases in the training process (the image prior is always the original screenshot of my desktop):

Valley
Canyon
Wes Anderson
Fir trees
Fall
Flock
Triple vision
Diffused

Looking forward

Of course, this toy example is pretty useless as is. The rendering is limited to the mouse, and even if it could render windows, it’s ultimately not connected to any filesystem or application. Also, the supervised training procedure makes it hard to scale to more complex tasks.

But, this experiment is complementary to backend-GPT, which got some attention a few months ago. Instead of using CLIP as the embedder, some bigger LLM and memory module could be fused with SD to be used as the backend, along with other modality intakes such as the mouse. With a better backend, here are a couple of ideas of more useful things that could be done for part 2:

To make these ideas work, some smarter self-supervised training procedure should be devised, which would probably be the hardest part to crack. Also, the speed of inference would need to be dramatically improved for it to be truly real-time.

If you’re interested in any of this or have ideas, please reach out! In any case stay tuned for part 2!

Related

Here are papers and articles that I thought about while working on this project:

Reactions