Relaunching neural frames
A few weeks ago, after two intense months with what started as a side project, the usage of neural frames had dropped of quite a bit. Also, I was annoyed with how I wrote the thing, had ideas how to improve the site but the existing implementation put a lot of restrictions on what I could do. I was burning money every day and decided to write a mail to all subscribers asking for forgiveness and temporarily took the platform offline.
It took me six weeks of non-stop work to rewrite basically the whole thing. And now it's out again (tada!). There's a couple of new features that I am really proud of, and a couple of features that I will be very proud of once the UX is better than it currently is.
Recap of the technology
neural frames is based on stable diffusion which is a text-to-image neural network. Basically, you type in some text (the prompt) and the AI will generate an image out of it. Additionally, it also offers the capability to combine an input prompt and an input image to generate a new image. neural frames combines these two capabilities. First, the user generates an image with a text prompt, and then the user can generate videos by iteratively creating new images out of the last image. Stories are then not told with prompts like this: "A beautiful landscape in nature is turning into an artistic expression of biological cells" but with a series of prompts:
- 0-3 seconds: The prompt could be "A high resolution photography of a beautiful landscape"
- 3-6 seconds: The prompt could be "biological cells, beautiful digital art"
It is a new kind of story telling - a frame-based storytelling that does not allow for smooth movements of people (at least not yet) but it allows to produce really cool videos. Also, because tricks can be played between iteratively feeding images into the AI, for instance, by zooming into the image slightly, one can create serious camera movements. I am convinced that whatever happens to "true" text-to-video, this frame-based style of video generation will remain to have a place in the realm digital content creation.
A new type of motion and interpolation algorithm
If you just put the output images of the AI after another and try to make a video from it, the video will be somewhat brutal. Initially, I launched neural frames like this, but quickly realised, that the videos were unpleasant to look at. Therefore, what you want is to interpolate between two AI outputs over the temporal dimension. If you do this with motion (such as zoom), you run into problems because one image will be larger than the other and interpolating between them will just be very blurry.
neural frames 2.0 has a new algorithm to interpolate between images in time, taking the motion into account and it is greatly inspired by the awe-inspiring open-source tool Deforum. It's so cool how smooth the videos can become with this.
I am planning to add more dimensions (other than zoom) soon, if there's some interest in it.
Prompt transitions
There was another thing that bugged me with the old platform: When going from one prompt to the next one, there was no transition. It basically went instantly from one prompt to the next. And this was not only true for the prompts but also for the parameters. You set a zoom for one prompt of 1% per frame and in the next prompt it was -1% per frame and basically there was a sudden, unpleasant jump in zoom velocity that kind of made me nauseous.
So I have built a new kind of video editor. A true video editor. With sections that can be added that represent prompts and their parameters and - if there is space left between two sections - there will be interpolation between them. I will spare you with the technological details of this but it basically leads to full control over the transitions and it is actually quite some fun to play with.
Model fine tuning
This kind of goes under the category of "UX still needs to improve" but it is a super powerful feature that I just didn't want to re-launch without. Basically, when uploading 10-20 images of a person or an object, it is possible to teach the image-generation AI of this object/person (as already mentioned in a previous blog post). Making videos with this is super powerful and so far not widely spread. For instance, may I introduce you to a video of myself as a laughing hippie (the transition to the laughing only looks so decent because of the prompt fade). The second video is with a fine-tuned model of a coffee mug.
Pimp my prompt
It is really quite incredible what types of animations can be created from not more than text inputs. Whereas in the pre-text2video era, a digital artist needed a lot of technical expertise to create stunning visuals, she now does not need more than the right words. In return, this means that the prompting is growing strongly in significance - and there are heeps of job openings for the position of the "prompt engineer" already.
Also for neural frames, I saw that the prompting part is actually the hardest - well what DO you want to see a video of? I added a Button called "pimp my prompt" that sends whatever the user typed as the prompt to a large language model I prepared, in order to make the text inputs more "stable diffusion-esk".
Wrapping up
Well, that's it for now. I am working hard to make things more robust and hope that you will be part of that journey! In the end, it's a trippy visualizer - and probably the most cutting-edge there is. It's also a great way to create AI music videos. An AI music video generator can look like this.
Once a week I am writing about life as an indiepreneur in the space of AI. If you are interested in being part of that journey, I'd be honored if you subscribed to this blog by clicking on the purple button at the bottom right. Thanks! ❤️