The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed
fal is building the infrastructure layer for the generative media boom. In this episode, founders Gorkem Yurtseven, Burkay Gur, and Head of Engineering Batuhan Taskaya explain why video models present a completely different optimization problem than LLMs, one that is compute-bound, architecturally volatile, and changing every 30 days. They discuss how fal's tracing compiler, custom kernels, and globally distributed GPU fleet enable them to run more than 600 image and video models simultaneously, often faster than the labs that trained them. The team also shares what they’re seeing from the demand side: AI-native studios, personalized education, programmatic advertising, and early engagement from Hollywood. They argue that generative video is following a trajectory similar to early CGI—initial skepticism giving way to a new medium with its own workflows, aesthetics, and economic models.Hosted by Sonya Huang, Sequoia Capital
- Published
- Published Dec 10, 2025
- Uploaded
- Uploaded Jun 11, 2026
- File type
- POD
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] We recently had our [00:01] first generative media conference and Jeffrey Katzenberg, former CEO of DreamWorks was there and he made a comparison. He said, this is exactly playing out how animation, when it first came out, people revolted against it. It was all hand drawn before that and computer graphics, it was new and there was a lot of rebellion against computer driven animation and something [00:31] of stopping technology. It's just gonna happen. You're either gonna be part of it or not. [00:36] *music* [00:53] In this episode, we sit down with a team from Fall, the developer platform and infrastructure powering generative video at scale. Fall is a place that developers can go to access more than 600 generative media models simultaneously, [01:05] from OpenAI Sora and Google Veo to open weight models like Kling. [01:09] We'll discuss why video models present fundamentally different optimization challenges than LLMs, why the open source ecosystem for video has a thriving long tail in ways that text models never did, and why the top video models have a house life of just 30 days. [01:24] The team also shares insights from the demand side of the video model equation. We discuss what's happening in the app layer, from AI Native Studios to personalized education, what's happening in Hollywood, and more.
[01:34] Enjoy the show. [01:35] Thank you so much for joining us today. I want to start with the problem space that you decided to tackle. So FAL is a developer API and platform for generative video and image models. Video is massive, obviously, it is more than 80% of the internet's bandwidth, and it follows that generative video is going to be similarly massive. But there's not that many companies that are focused on this problem. Why do you think that is? [02:01] Yeah, in a way, generative image and then video was an overlooked market in this current phase of AI. In my opinion, for two reasons. Number one, there wasn't... [02:15] a very clear... [02:16] industry use case that people were going after. There wasn't wipe coding that [02:22] automates software engineering or there wasn't search, which seems LLM market is going after or customer support, anything like that. Also, number two, the investment on the research side wasn't as big three years ago and then that ramped up. [02:42] A little bit slower than LLMs, but still considerably since then. And now the models are much more capable, much more useful, and real industry use cases compared to what it was three years ago. It felt like a toy use case. This was just going to be for fun on the side, and it's going to be a small market at the end. And now we can see that it's going to be a massive market with very unique use cases and customers compared to the LLM market.
[03:12] you actually go back to like as we were experiencing it, I think that was an interesting time. We were [03:19] We were working on some like Python compute infrastructure. And then these models like DALI2 had just come out. And then soon after that, ChatGPT had come out. And then Lama had come out. And we were just like, we were initially, we didn't know that, you know, [03:38] image and video market was going to get that big. We were actually just curious about like running image models much faster. That was like our initial entry point. And then we saw like the initial growth. We had a few customers and they were growing really fast. We were like, "What the heck is going on?" And then, you know, a few customers later, we actually thought, "Hey, we should double down here." And around that time, also the other thing that was happening was [04:06] people were over-indexed on language models. This like story of AGI was being told and, you know, that attracted all the dollars, that attracted all the talent. So everyone was just like working on that where we thought like we had something niche, like growing fast, you know, don't tell anyone. And then we just like, [04:26] started focusing on that and and soon after like as we got more familiar with the models we thought the [04:32] I remember, I think we changed our website copy to say generative media, generative media platform. And then it was only like two or three months after that, Sora was announced. So we were definitely like ahead. But we really saw the whole like future kind of coming with like better image models, video models, etc. So yeah, we made this early bet. I mean, you guys have a front row seat to the sorts of new experiences people are building. I think the market's only going to expand from the media market that we know today.
[05:02] Yeah, absolutely. I think like Alcora Karpati tweet, you know, no, no good podcast without it. He did, he did one like recently where he was talking about like why he's excited about the, you know, media models. And one of the things he said was that, like, he also mentioned that, right, you know, people are visual, and we're gonna, we have more, so much more video than, than text, like wall of text. And he was saying, like, um, [05:27] He was making a point around like education and a lot of the like content you consume just to like learn things. I think right now the model quality is just like [05:37] Like relatively, it's just so much... [05:40] like [05:41] It's so much worse than what it can be, where like you could actually like have, you know, I do a lot of learning on ChatGPT, but it's like through text. But if it actually rendered a video where it could compress like a concept, right, instead of, you know. [05:55] 10,000 characters, if it could do it in like 15 seconds, it'd be so much better. I think like there's sort of like [06:03] the quality bar where if [06:06] like it's going to go up. And once we have that, we're going to have even more penetration. So it's really a function of [06:14] the quality right now and we're just like in the very early beginnings. Totally. Education market almost untouched right now with video generation and there's so much potential there and it's just waiting for, [06:25] the quality, the predictability to get there. And I think it's going to have a lot of potential. Totally. I mean, you guys sent me that generative video Bible app. I think it's a much better way to learn some of the lessons from the Bible. If you're capturing consumers' attention right where they are,
[06:42] Exactly. I think it's I agree with you. We're just at the beginning. So fall is an infrastructure company. And so we're going to structure today's interview. I love infrastructure companies in terms of the technical layer cake. So we're going to start from the core inference engine compilers, kernels that you've built. We're going to go up to the model layer and then the workflows and then end with some observations on the markets and what people are building. [07:05] Let's do it. Sounds exciting. Okay, let's do it. The inference engine. Batuan, how old are you? 22. You're 22 years old. Okay. Stay around your background. I think it's super badass and makes complete sense why this company is so hardcore. I started working on compilers when I was 14. So in a way, I have a lot of experience on that front. It's not just that. But I started working on open source projects. [07:35] And then I started to slowly contribute back to the Python language, core compiler, core parser, and the core interpreter itself, and became one of the core maintainers of it. I think at the time I was the youngest core maintainer of the language. And this kind of gave me a unique appreciation of compilers and how flexible they are. So when we first started working on serving these image models at fall, [07:58] the main idea was, okay, there is these [08:02] three different image models, three different architectures, but this is surely going to explode. There's upscalers. There's going to be video models. We were predicting that, and we didn't want to go optimize a single model, put our
[08:14] eggs into a single basket and then became invalidated when the next model comes. So we started building this inference engine, which is a tracing compiler that traces execution and essentially tries to find common patterns that are fitting within the templated kernels that we do. So our bread and butter is like spending, we have a 10% performance team that's spending all their efforts into writing kernels that are like 95% there, but like generalized with templates. [08:44] like replace these templated semi-generic kernels to like specialized kernels at runtime and optimize the performance of these models. And we found this technique to yield like pretty much superior results from anything that's out there in the market. And this led us to claim like number one spot on performance on all the benchmarks. And another big thing about this is we specialize in doing like this sort of kernel level mathematically correct sound abstractions that let us like, [09:14] media industry and when you really care about the output that you're getting. [09:18] What's different between optimizing a diffusion model versus another regressive LLM? [09:23] In auto-regressive LLMs, your bottleneck is how fast you can move [09:27] all those giant weights from memory to SRAM, because you have like a 600 billion dollar parameter model, and you're trying to predict the next token. You're doing the attention for all the tokens, like a couple tokens before that. In diffusion models, you're trying to denoise [09:41] thousands, tens of thousands of tokens for a video at the same time doing attention of it. So you're essentially saturating all the compute bandwidth of these GPUs. You're not necessarily bound on memory bandwidth, but like the computational operations that you do are like fully saturated. So you're trying to find better ways to execute around the GPU. This could be like writing more efficient kernels, or this could be overlapping, you know, softmax with GAMs that you do. Like, it's essentially like you're trying to use all of the power of the GPU, leverage it in a way that
[10:11] capabilities. So it's a different binding constraint. It's on the compute versus the memory. And what's the intuition for why LLMs are [10:18] relatively memory constrained and why video models are by comparison relatively compute constrained but not as large in terms of just sheer number of parameters i i think it's scaling issue right like in terms of like if you scale these video models 600 million parameters with the same dense architecture you're gonna have to do attention with all those full like hundred like let's say a single video has a hundred thousand tokens and you do this attention step or like you do this like denoising step 50 times and every 50 times you do like attention over all these like hundred [10:48] I think the constraint there is like just how fast you can do the inference. And the same applies to that alarm at like larger batch sizes. But like at like the traffic patterns that people do, like, you know, the batch sizes are not that much. And you're mainly constrained by memory bandwidth. So people do optimizations like speculative decoding and other factors to like reduce that overload. Yeah. What exactly goes into being at the top of the leaderboard in terms of performance? Because I would imagine there's other teams that are also very smart people. And, you know, this is my Olympics. [11:18] people have like very similar ideas on the techniques and different optimizations they can do i don't think anyone cares about it as much as us we are literally obsessed with jarrington media we are literally obsessed with these models we have a team that's like just focusing on this like so far like it seems like and from nvidia to other inference players everyone is like super obsessed with language models everyone's trying to get like one more tokens per second on like deep seek benchmarks whatever and like we're like on a on a different lane we
[11:48] we found out like the best way to optimize these general models. And we just focus on this. This is like a purely focused thing, right? Like at the end of the day, you're constrained by the hardware. There's nothing unique about it. But like, we're just like three months ahead, six months ahead. Like when we benchmark Torch, like the latest version of Torch against like, you know, our inference engine from a year ago, [12:07] We're clearly underperforming because Torch caught up. The same thing is going to happen with other players. You're going to always... [12:13] The lead that you can maintain is three months, six months ahead at most. The thing that matters is just focus. If you focus on it, if you purely put all your energy into it, I think that's like there's – it's very hard to – [12:24] get outcompeted by others? Because models are slightly changing each month, each release. So it's still the same general architecture, but there are slight differences where we can go in and optimize where that's different and no one else is paying that much attention to it. Also, hardware is changing as well. So we were able to adapt to B200 earlier than anyone else. And we were able to run video models much faster basically throughout the year because of that obsession with [12:54] hardware. Yeah, got it. What are the hardest technical problems that you think you're solving? So one thing people don't appreciate it as much is we are running 600 different models at the same time. We have to be running them [13:07] we have to be so good at running them, that we should be running a single one of them better than, as if someone else is running a single model. Because when a foundational lab is running models, maybe they have a single version of the model, maybe they have like,
[13:21] couple other different versions, and that's what all they care about. We have to be better than them at running those models, and we have to be doing [13:30] all 600 at the same time. So on top of the inference optimizations that happens in the GPU, a lot of optimizations [13:37] on the infrastructure level needs to happen. We need to manage the GPU cluster in a way that's efficient to load and load these models at the right times. We need to route traffic to the right GPUs who have the warm cache of these models. We need to be smart about choosing the right kind of machines, which kind of chips are running what kind of models. And the customer traffic is changing all the time, and we need to adapt towards that. So on top of the inference engine, [14:07] really, really hard beast to manage. And so far we've done an incredible job at that. Would you add anything to the-- - I think that's a pretty fair, fair explanation of what we do. Like I call this distributed supercomputing. I don't know why people don't like that name, but like-- - I like it. - I like it. - But like the idea is we are at like 28, 30, like this was like a month ago. Now we are probably at 35 different data centers. And you have these like, [14:34] heterogeneous groups of compute that split across with their own like you know different specs different networking whatever and you're trying to schedule workloads as if it's like a homogeneous cluster that you got from a hyperscaler it doesn't work like that so we built like we spent the last three years building that directions over it from our own orchestrator to building our own cdn like we go back to like you know fundamentals of web and we built our own cdn service uh deploying racks to colos like you know just like routing traffic so we built all these technologies to essentially
[15:04] or whatever it does, as schedule our workloads. [15:06] which is like very different than like a traditional enterprise LLM usage pattern. You know, like the use case that we have are like so much more spread across, so much more, you know, like more consumer facing. And like when you consider that, like there is so much investment going into making sure we can tap into this like scarce capacity of GPUs. Yeah. You mentioned hyperscalers. And, you know, I hear distributed compute and I hear managing giant clusters. And I naturally think that's somewhere where hyperscalers should have the incumbent advantage. [15:36] able to so far execute them on the core engine. [15:40] There's two things about the kernel, right? There's the inference part where none of typerscares have any expertise. This is a net new field. [15:48] I think this has been only happening for the past three years, inference optimization. So it's like a brand new lane that we have been outcompeting anyone in our field. I think that's like a pretty much answer of its own. And the second one is infrastructure. I think right now hyperscalers are... [16:02] very busy with their traditional [16:05] pattern of, oh, we have this data center capacity, we'll just deploy GPUs and we don't care about trust. This has been changing recently, you know, even like Microsoft is going buying from NeoClouds. This is like, there's like an interesting pattern happening because the GPUs and the demand and the growth of GPUs doesn't fit to the patterns of these hyperscalers, the growth patterns they expect. So I think at this age, not even hyperscalers have that big of an advantage of scale because like they're going buying GPUs from NeoClouds, like the tables have turned a bit. [16:33] Yeah, it almost helps to also be like slightly earlier, like in the company journey, right? Like if you're a public company, you also have to kind of abide by like what the...
[16:45] what the market's expecting of you. So, like, the other thing is that there's a huge price discrepancy with hyperscalers and neoclouds, right? So, like, it's maybe sometimes 2x, 3x more expensive to use things, you know, through hyperscalers. What's driving that? [17:05] Well... [17:05] I think one is like market pressure, right? And also like there's added kind of operational expenses that hyperscalers have for like having, you know, better... [17:17] They just have a better service, right? Better uptime and better SLAs and all of these things add up. And then on top of that, there's kind of an established like cloud margin, right? And, you know, the market expects the cloud margin to be a certain level. Whereas like if you have a three-year-old neocloud, you know, you're a private company, [17:37] Maybe you don't have as much pressure. And there's like assuming infinite demand and limited capacity. You can actually, you know, hyperscalers can keep their prices high and and, you know, they will fill out the capacity and also like get get slightly better economics. Whereas like neoclouds compete over the whole whole like infinite demand and that that pushes the prices down. [17:59] Perfect prize competition. What does it take to run image versus video models? Well, you guys started the company around a stable diffusion moment. The field was mostly image at the time. How does running video models compare to image?
[18:29] let's say it takes 1x of teraflops. I think it's tens of teraflops, but let's call that unit one. One image is around 100x of that. And if you are doing a five second video, 24 FPS, that is around 120 frames. So 100x from one image. So you are already 100x of the image. [18:57] of 100X, so you're at 10,000X for a low standard definition video. And if you want to do 4K, that's another 10X. So 10,000X compared to a single 200 token LLM input. So it is a lot more compute intensive in terms of amount of flops you are doing. [19:22] Yeah. In general, like when we started with image, the infrastructure was relatively easier to do because it's like it takes three seconds or it took 15 seconds back in the days. It takes 15 seconds to generate an image. You don't need to necessarily shave like that 50ms, 100ms you have overall in the system. And then when we went to video, it's like even easier because it takes like 20 seconds, 30 seconds to generate a video. [19:52] link from these GPUs [19:54] That's where we actually spend some of our time. We started this progress with speech-to-speech models a year ago. We started optimizing them, where we were able to reduce the latency of our system with globally distributed GPU fleet. When you send a request, we route to the closest GPU, minimize our own overhead, and then do stuff like we pick the best runner, stuff like that. So we are now applying those same optimizations we did to real-time video. And we actually see really good, interesting demand there, where people want to experience this stuff,
[20:24] as they type, as they prompt. And that's where some of the infrastructure technical challenges differ from traditionally running image and video models. Because image and video are... [20:32] similar-ish, you know, like just more compute expensive. But like you actually need to care about infrastructure stuff. When you go from like [20:39] less than a second generation time for some of these [20:42] Yeah. Another interesting thing is like image models, especially, you were able to run them on a single GPU. Like the parameter counts is actually much smaller. So that actually makes it like a little bit easier for us as opposed to LLMs. And then with video, parameter count is going up. Right now, I think we're around like for the open source ones. [21:05] I don't know, 30 billion parameter. Whereas, you know, we... [21:09] We hear rumors about GPT-4 being like in the trillions, GPT-5. [21:14] maybe more. So that's another... That's like... [21:18] you know on the on the flip side it's a little bit easier but you [21:22] it doesn't mean video models are not going to grow, right? There's, you know, rumors around numbers for VIO, numbers around Sora. So, like, there's also an increase in parameter count. So that you're going to have to kind of, [21:37] use more distributed computing. But if you're just eight nodes, or one node or eight nodes, you kind of have a slight advantage. Yeah, totally. Okay, let's pop one layer of the stack to the models. Let's do it. [21:50] So one thing I think people don't fully appreciate at the media space, and you mentioned this, you alluded to this before, is that there's a very, very long tail of models that are actually used in practice. And so I was hoping you give people a sense of on your platform, how many models are people actively using? How is it distributed? And like, why do you think there's such a long tail of models being used compared to the LLM space?
[22:10] This is actually one of the things I would say three years ago people got it wrong. I mean, jury is still out, but people... [22:18] right after the ChatGPT, people start talking about Omni models. There's going to be these giant models that they're going to be able to generate video, audio, image, and code, text, every type of token. This might still happen, I think, but it's more clear that... [22:37] you are better off if you optimize for a certain type of output. Even this is true for code generation, definitely true for image or video output. So that's one thing when we were pitching three years ago, everyone like that's one feedback we got or there's going to be omni models and there's going to be a single way of running these. It's going to be hard to create [23:01] an edge on the modality, but turns out it's not true, and it actually makes sense to have a technical edge on the modality. And this is one of the reasons why there's also a variety of models, because still the best upscaling model is just doing upscaling, and the best image editing model, even the best text-to-image model, is different from the image editing model. So all these special tasks require... [23:29] their own model. It might be the similar model family or similar architecture, but at the end of the day, it has its own weights that needs to be deployed independently and that creates the variety in the ecosystem. I think this also applies to language models where
[23:47] Even in the same modality, there is different families of models with different tastes, different characteristics. There's different personas. And this happens with language models still. The code that Cloud writes is very different than the code GPT-5 does. And we see this happening, but the good thing about here is there's these three, four different... [24:06] personas on top of like different like categories upscaling editing video text of video whatever stuff like that so it gets you like you know close to 50 models that are active at any point in time and then you have a very long tail of models that people still choose because they might like the persona of that better yeah totally um [24:25] Speaking of model personalities, what are some of the most popular models on your platform? What do you think are the personalities of them? [24:32] So one thing that's been true since the beginning, the popular models change all the time. So there's always... [24:39] new releases from different labs that take over the other and it's always a moving target. But that being said, there's two types of models usually preferred by our customers. It's [24:51] Usually there's one big expensive model that has the best quality on video generation. This could be Vio, this could be Kling, this could be Sora. And then there's usually a workhorse model which is cheaper, smaller, but good enough. And people usually use that at higher volumes. I would say this has been true for the past almost two years that there's an expensive high quality model. [25:17] that keeps changing. There's a cheaper...
[25:21] good enough model that keeps changing. But overall, this has been constant. And is the workhorse model for prototyping and then you run it through the big expensive model for the final product? Or what do people use the workhorse? It's for higher volume use cases. And depending on the application you are building, you might encourage different, like lots of variations of the same output maybe. But it's very application specific, I would say. Yeah. There's also another dimension, I think, that's like, [25:51] kind of happening in real time right now, which is based on like the different use case you want to use the model for. So like one, when... [25:59] OpenAI [26:01] released GPT image editing, that model had just like [26:05] like superior text editing, text generation and editing capabilities. And for things that [26:11] require like a lot of text people started going [26:14] and choosing that model versus the other models. So it also tends to correlate with [26:21] like different capabilities models are bringing and, and also like what they're like, what they're good at. Right. So like cling, for example, people really like it for visual effects type types of workflows, because, you know, they had, they had that kind of data in their data set, as opposed to, you know, some, some, some other models, for example, C dance is like very good at like detailed textures and, and artistic diversity, things like that.
[26:51] It's really a matter of also like this sort of use case dimension that models excel at. [26:59] An interesting metric that we saw on Q2 and Q3 was the half-life of a [27:04] top five model was 30 days. [27:06] Wow. It's very, very interesting to me where like, you know, these models are continuously shifting, like the top five of the models are continuously shifting. [27:15] Tough depreciation schedule for the model providers. Hopefully they are building on top of the work that they already done. So it's, you know, additive to the end. But yeah. [27:26] Yeah, I'm teasing. And the model probably is in a more turbulent state right now than what the end state will probably be. [27:34] What do you guys think is the most underrated model? Like what's your personal favorites? [27:38] I usually like cling models for Vidya. [27:43] But... [27:44] This kind of has been changing because they don't have sound. For sound, we have VO3 and Sora. They are the only ones. A lot of people are working on it, so we'd love to have more variety there as well. Image models, I like Rev's model. Flux still holds a very nostalgic, even though it's been a year value for me. I still go back to Flux. There's variations of Flux models now that I like. [28:11] I'll go with mid-journey, which is not on file. It's not available on API. I just like the... [28:18] how they navigated the space, I think, I think is very interesting. Like they kind of brought this like photo realism, which was, um,
[28:27] That was a very big deal at the time. No model could do it. And then now they're more like this [28:33] artsy model [28:34] Right. Like it's no photorealism is kind of cracked and like no one cares about it. So now they have this like niche, very artistic like visuals, which is very cool. Yeah. I'd love to chat about the marketplace dynamics a little bit. So I understand your business as a little bit of a marketplace where you aggregate developers on one side of the market. That's the demand side. And you aggregate model vendors on the other side of the market. That's the supply side. And the model vendors are both [28:59] proprietary APIs, model labs that view you as a distribution partner, [29:05] And then also open models that you host and run yourselves. And so maybe talk a little bit about for the closed model providers, you have partnerships with OpenAI Sora, with DeepMind on VO. What's in it for them? Why do they choose to partner with you? [29:22] We were one of the first platforms that accumulated... [29:27] developer love and following from that, these developers work at big companies, so they started working with us and we really built the platform for simplicity and being able to [29:42] get going really fast. And because the thing Batuan mentioned, the half-life of these models is really short, people usually work with many different models at the same time. So we were able to claim that we have this big developer base that love the platform and not tied into any single model and here for the platform. And model research labs see this
[30:12] distribution channel and tap into the developer ecosystem that we built. On the other side, this helps us with the, [30:21] next model provider because they see all the developers, they want to be on the platform as well, which attracts more developers on the platform and creates a very nice positive flywheel for us. Yeah, it very much is a marketplace business. And for developers, it's a single choke point to be able to access multiple model vendors. And to your point on like the model space is changing so quickly, I think they really do value that choice. Yeah, we call it marketplace [30:51] developers so there's additional benefits to [30:54] which ties into the flywheel effect that we are creating. So it's marketplace plus other services next to it. How do you position yourselves to get, you know, in some cases, day zero launch access, sometimes exclusive launch access to models like Kling and Minimax? How have you done that? [31:12] Throughout the last two years, we were able to build a very robust marketing machine as well. And this is our connection point with the developers who are on the platform. Every time we release something, this creates another opportunity for us to build. [31:27] introduce a new capability, introduce a new model, and model developers also see that. And we usually do co-marketing together. And part of that co-marketing, we get exclusive release access for a certain period of time, sometimes forever.
[31:44] We have a couple of competitors that are... [31:48] on the smaller side so model developers want to work with the biggest platform out there and increasingly that platform is ours and we get to have these exclusive benefits with the model providers that's awesome why do you think it is that the open source model ecosystem has been so vibrant [32:06] For video models, you know, it almost feels like the text models are just consistently a generation behind, whereas in video, you know, there's so much that's happening in the open source realm. Video and also image editing as well. Why do you think that is? [32:21] It started with Stability, their first open source Stable Diffusion, and got insane adoption. [32:28] Almost the same team then started Black Forest Labs and they knew the power of open source, how it helps them create the ecosystem. And with image and media models, the ecosystem actually matters. When developers are training LORAs, they are building adapters, they are building on top of your model, it really matters. [32:46] brings free marketing, but also creates stickiness. So there are still people who are using stable diffusion models because they like that ecosystem, because it was so open. So the Flux team saw this, [33:01] from their experience at stability and they had a very smart strategy of having at least some [33:06] Models that are open source, some that are closed source. And a lot of video model providers that came after is following the same playbook because you can have a very robust ecosystem. It gives you a lot of advantages in terms of marketing, in terms of developer love. And I think it's going to keep going like this. Yeah, totally. I want to add on to that. Like the domain is also very interesting. Like I think in the visual domain, like ecosystem actually matters more.
[33:36] Lama 2 first came out, there was like... [33:39] Many fine tunes out there, but if you actually downloaded it and start using one, [33:44] Like you can't, [33:45] You can't tell the difference. You can't really... If you're using a... [33:51] I don't know, like a control net. The concept doesn't even exist. It doesn't, you know, language models are a lot more general, like generalized. So you can't really... [34:02] understand [34:03] like the difference if you were to actually fine-tune it, right? So it kind of just ends up being very monolithic, as opposed to like in the visual realm, it's just like any small adjustment you make to the model, it can actually... [34:19] you know, [34:20] it can actually have huge implications, right? And so [34:24] And so it's just – [34:27] very fertile ground for a lot of customization. Yeah. I mean, speaking of Midjourney, David Holtz, one of his quotes that I like is, you know, he's curating the aesthetic space with Midjourney. Yeah. And I very much think you just have this combinatorial explosion of styles aesthetically and... [34:46] I think that's the reason why some of the models on your platform are fine tunes of other models, right? Yes, yes. And, like, the thing is, like, even if you add a lot of diversity of aesthetics, like, [34:58] if you if you actually train on everything like if you have trained on too many you may not be able to like actually get the exact like like there's so many times you want the exact aesthetics and then you still you may still have to like fine tune the model to get exactly the output you want whereas like with LLMs that's not really like how you operate you don't exactly want a particular outcome it's it's like a different it's a different problem uh so so this is a lot more
[35:28] You kind of have to do these like post-training things on top of the models. Sora is another good example. Like Sora 2 is very fine-tuned on like social-looking stuff, right? And so, you know, you could probably, you know, you can have tens of different styles and you still want to probably... [35:48] push the model towards that direction with post training. [35:51] Yeah, absolutely. It all depends on the use case too. Like a customer support chatbot [35:57] does not need personality. Like you want it to be as vanilla as possible, but we are talking about filmmakers, marketing teams. They all want to add the personality of their style or their brand. So they want to have... [36:12] greater control over the outputs, whereas maybe in LLMs, that's not necessarily true all the time. If you have an agent, if you are in code generation, there's no equivalent of style and personality. - Yeah, okay, that's a good segue for us to go. One more layer up the stack. Let's go to workflows. [36:30] What does the average developer workflow inside fall look like today? [36:35] They are using many different models, first of all. So we looked this up recently. Our top 100 customers, they are using 14 different models at the same time. These are sometimes chained to each other. So one text-to-image model, one upscaler, one image-to-video model, all part of a same workflow, or like a more complicated combination of this part of a same workflow,
[37:05] I think that's the most interesting part, the variety of the models people use on the platform. We do have a no-code workflow builder as well. We built this in collaboration with Shopify. And this is usually very good for their PMs, their marketing teams, the non-technical members of the team who are playing with these models. It's really good for trying different things, really good for comparing different models. [37:35] This makes it into the product as well. You can reach to this workflow through an API. It's been... [37:41] very popular recently and more and more people in a typical software engineering organization is now interested in image and video models so the users of this platform has been increasing. [37:53] Okay, so the average workflow is not just text to prompt. It's not create a five-minute commercial that does it. If I wanted to create a five-minute commercial, what would the workflow be? [38:03] Yeah, so... [38:04] For this reason, people actually prefer open, like that's one of the reasons why people prefer open source models, because they get to have more control over the model and they can add things here and there to steer the model towards the outputs they want. [38:34] And then these workflows are usually like the ones, if you've seen any like big conf UI workflows with many different nodes, it resembles those where each different piece of the model can be replaced to create more control for the creator.
[38:50] Got it. Yeah. And I think like what we have, like our workflow tool, it's not the final form of, like there's almost like a, another layer of abstraction, maybe on top in terms of workflow. And like, as we talk to like these studios, we actually figure out like, there's so many ways of just like, there's so many ways of using Photoshop. Like there's no single workflow. In [39:13] There's probably like... [39:16] Based on your role, right? Like you're a marketing person or you're a animator or whatever, like you have different workflows, right? And so I think that is also emerging. Like as more and more like professionals are actually starting to use these tools, like you see the emergence of like – [39:32] very particular workflows, right? One of our favorite creators is PJ Ace. He actually shares his workflows online. And every time he posts things, every month he actually has a different kind of workflow. It's really driven by the new models, like... [39:53] Based on based on new model, he may have a completely new new workflow next time. I think I think once like I [39:59] we sort of reach some sort of [40:02] I guess like some sort of productivity and, and, you know, some professionals actually adopting these tools, there will probably be more sort of standardized, like best practices around, around using these, these abstractions. But like, you know, it's not, I don't think anyone knows like the final, final form yet. And it's, it's like every day we see new things and we try to like update our product to make sure like it, it caters to those people.
[40:32] for high level what you want and you type that in and then [40:35] and the aesthetics that you want and you iterate on the aesthetics from an image model. [40:39] And then use that image model with the aesthetics you want to then generate a series of images, which then form the storyboard, so to speak. And then it cascades down. Exactly. And then the video models kind of interpolate. [40:51] in between them and it's funny because that's actually how you know that's how you know pixar and all these companies work right in terms of storyboards and so i think it was a cost thing in the beginning yeah like that's why they had to do it like that but like it actually also makes sense right like it makes sense in so many ways to do it to do it like that and yeah they call the that [41:11] stuff pre-production and then you know post or production right so pre-production is all the all the tooling around storyboarding etc like [41:19] That's what everyone does, like, even today. Even though it was, like, a very cost thing, now it's more of a speed thing. And AI makes the workflow, you know, very interesting where you have everything laid out, and let's say a new model, new text-to-image model comes out. They built it in such a way that, okay, you can press a button, and now all different combinations are going to be generated with this other model, and then you can, like, generate all the videos again.
[41:49] flows you want to update one thing and the whole thing is going to cost like a thousand dollars to rerun it again but these individuals like they spend a ton of money on on on creator platforms i've seen bills like half a million dollars just spent by a single individual and maybe even even more when it's it's a small production studio stuff like that so it's it's it's pretty incredible [42:15] Wonderful. Okay. Speaking of studios who are building on your platform, let's go. Our final layer up the stack, let's talk about customers and markets and then what the future might hold. Maybe what are the coolest things that people are building on your platform today? And are they what we would think of as traditional media businesses or are they net new businesses? It's all over the place. Like what's so exciting about this space is that it just goes across like – [42:43] all of the [42:44] markets you can possibly imagine. I'll give you some more... [42:49] I guess long tail stuff first, because it's super fun and interesting. There's a security company that's building on top of FAL, and they basically have these like trainings, and the trainings are generated on the fly. And the content is all dynamic. Obviously, they have some scripts, I'm guessing, to kind of fit like the curriculum, but like the content you get... [43:12] you know, per person, um, is, is, is all dynamic. It's Brian Long's company. Yeah. This is adaptive security. Um, yeah, they, they do, they do some really cool stuff. Uh, I think that's one of the like most unique, uh, use cases. Uh, you can see how that translates into like rest of education. I think that market is like kind of picking up, uh, another one, I think like, um,
[43:35] This is a more common use case, I guess, is like AI Native Studios. You mentioned the Bible app. That was one of my favorites. It's called Faith. It's one of the highest ranked apps on the App Store. And yeah, they have stories for each of the stories from the Bible. And they're really well produced. And... [43:58] this sort of category of AI native studios, either in the form of applications or like they're doing feature projects [44:08] feature films and, you know, series and things like that. That's a huge category. So I would call this like maybe new media or like AI native media and entertainment. There is also a lot of like design and productivity, like out of our public media. [44:26] customers like Canva is one of those, Adobe is one of those. So they're integrating kind of like in this, you know, in this older tooling, they're integrating new models. Ads is a big one. So, and ads kind of come in many flavors. Basically, there's like the UGC style ads, like the stuff you see, like there's a person, you know, demoing a product. That's like a very big category. [44:56] older people [44:57] styles of ads, right? More professional looking, higher production. Maybe you saw the Coca-Cola ad that came out recently. That's a controversy. Yeah. So that's like a kind of a higher production
[45:10] Um, you know, [45:11] style of ads but but you know what we're excited about is also like programmatic ads right so where you can do personalized um you know to the degree of like like literally individuals um you know yourself being the ad or in the movies whatever so like yeah that's that's also a big like growing use case yeah i'm most excited for the education use case i think that you know ads is ads is you know the backbone of the of commerce and the internet and so like like super compelling [45:41] But education is a market that's like so important and has never really had that many compelling business cases behind it. And part of the challenge with education, I mean, the challenge has been the bottleneck to creating high quality content at scale that's actually ideal for the learner. And so I'm personally most excited about education. Same. Like, I really love the education use cases. [46:11] not the right form factor like if you if you actually want to fully realize like the the the [46:18] sort of power that these models are bringing, you actually need to go into the visual space because then, you know, it's so much more compact. It's more approachable. [46:27] And yeah, I think once we actually crack, like... [46:30] visual learning [46:32] like through these video models, that's when it's going to really just like impact people. [46:37] Do you think that the advent of generative media is going to increase the value of existing IP? So like Mario Brothers, Nintendo, Disney, Pikachu, all these things? Or do you think it's going to lead to the democratization of the creation of IP?
[47:07] we thought [47:09] all right, these iNative studios, they're just going to take over, and Hollywood is just going to be too slow, and this is going to just go past them, and they're going to be left behind. But this summer, something changed, and we've been talking to a lot of usual suspects from the Hollywood. We recently had our first generative media conference, and Jeffrey Katzenberg, former CEO of DreamWorks, was there, and he made a comparison. [47:39] how animation, when it first came out, people revolted against it. It was all hand-drawn before that, and computer graphics, it was new, and there was a lot of... [47:51] rebellion against computer-driven animation. And something very similar is happening with AI right now. But there's no way of stopping technology. It's just going to happen. You're either going to be part of it or not. So we are seeing a lot of existing IP holders are now taking this very seriously. And at least for the medium term, I think they are pretty well positioned because they have the technical people who are actually really interested behind the scenes in this technology. [48:21] They also have the IP, but they also have storytelling and filmmaking know-how. You still need... [48:28] Quite large budgets. Maybe things are going to get cheaper, but in the medium term, filmmaking is still going to be expensive. Yes, AI is going to make it maybe a little bit cheaper, but we need these deeply technical people who know filmmaking, who has the IP, who know storytelling to actually, in the beginning, be part of this. And I think they're going to play a big role in the next coming years in the AI ecosystem. When there's infinite content generation, it almost puts a value on the things that are finite.
[48:58] I think for those of us who grew up with Power Rangers or Neopets or whatever, there is just this nostalgia element and this finite supply of IP that really resonates with us. [49:09] The opposite is true, too, also. There's a lot of new, like, we had little toys of these Italian Reynolds characters. These are characters with no IP, no one owns them. They are completely AI-generated. [49:21] from the internet community. And once you have... [49:26] cheap generation of content, very different permutations of it, things that people like. [49:32] catches on and it becomes part of the zeitgeist. Yeah, totally. The opposite, there's signs of opposite being true as well. Yeah, both are true. How do you, related question, how do we prevent like the infinite slop machine, [49:46] state of the world, you know, there's this, you know, version where we're just connected to this machine that knows how to personalize stuff for us. And we're just, you know, we're just hooked up to the infinite slot slot machine. And there's a version where there's, you know, human creativity and artistry and things like that involves like, how do you think the world plays out? [50:04] I think – [50:06] humans eventually [50:08] like, [50:09] converge on the things that are more meaningful in general. Like... [50:15] I don't know, like no matter how much slot we fill the world with, I think I think, you know, taste prevails and people are drawn to like, you know, experiences that are personal and human. And, you know, I just think that that's going to happen. One interesting example of this was like when Meta announced Vibes and then Meta,
[50:39] OpenAI and our Sora 2, the reception was very different and one of the reasons [50:45] in my mind was like vibes was like positioned as this, uh, [50:49] slot machine kind of thing where they didn't have the product out at the time, but it was just like these AI-generated... [50:58] Like, [50:59] just you have no relation to the characters etc right like it was kind of this like detached thing whereas like [51:06] Sora really made it about friends, right? Like cameo and, you know, they were very vocal. And now you can cameo your pets. There you go. It's huge, right? So, yeah, I think like this connection, [51:18] to like friends and pets and things like that, that actually made, and Sora was also like, they were being very personal about it. They were, they were very adamant about like, Hey, we want to make this about friends. We want to make this about, you know you know, these connections as opposed to, you know, influence slot machine. So, so I, I think that's, you know, that, that, that perception was also, I think a good, good signal that like there's ways to, uh, [51:44] make this technology work in a good way. [51:48] Absolutely. Okay, I'm going to get your perspective on timelines. [51:51] on what's feasible today and what's feasible to come. I guess [51:56] Do you think that we'll see Hollywood-grade feature film length? [52:01] films [52:02] entirely generated by AI and if so, on what timeline? What does entirely generated by AI mean? Is it like no human involvement or like? No human filming. So human involvement. But editing is okay. Editing. Yes, absolutely human editing, but no human filming. I think less than a year we'll have like, you know, advanced video models combined with the storyboarding that people have been doing. You'll have feature grade short films.
[52:24] like less than [52:26] 20 minutes. I think that's [52:27] That's a fair estimation. Even today, I think you can do really great films. It's just not enough investment. [52:34] of time is going into these. But like with enough investment of time and the model quality will be there. [52:39] I think we're right there. Okay. And you think it's photorealistic? You think it's anime? What categories do you think are more likely to happen sooner? I think photorealistic is what everyone is targeting. But anime would be a cool one, right? It's like you don't see that many anime specialized models. Why not? I think it's... [52:58] there needs to be a market for that clearly. I think it's going to be animation or anime or cartoon like not photorealistic like [53:09] as far away from photorealistic as possible, maybe even as fantasy as possible, because filming photorealism is... [53:17] cheap and doable already. That's not what costs money when people are making movies. It's the non-photorealistic stuff that's actually expensive. [53:28] And... [53:29] Even if you look at the animated movies, some of my favorite movies are animated. The Toy Story series, How to Train Your Dragon, Shrek, Ratatouille. And people like these things not because... [53:42] It reminds them of photorealism. It's the storytelling that matters and this created a new medium. I think AI is going to be similar to animation and how that brought a whole different angle to filmmaking.
[53:56] I think feature films are hard because with photorealism you typically people usually like the movies that their favorite actors are in whatever actors, actresses so it's like one step removed from that's the thing that costs money to get the actors exactly we first need to build a connection to this AI generated character before we can turn it into a film yeah [54:25] But I think, like, yeah, I think it's... [54:29] among like different kinds of content like shorts you know uh i think italian brain rot is an amazing example right it was first like these characters and then it became a roblox game uh and making i don't even know like [54:43] you know, a lot of revenue. So, so yeah, I think, I think like AI native stuff is a shorter, [54:49] uh form content is is probably gonna be very very big. [54:53] We saw this with VFX where like the VFX effects, like one of the most expensive parts of like producing these videos or films is like got like AI fight very, very quickly because it's very easy for AI to do like explosions, right? Like or a building collapse. It's like almost perfect now. And I think it's just going to continue along on that dimension. And maybe facial expressions are going to be hard. And very hard. You don't have to do facial expressions. That's going to be OK. But now they can do gymnastics. Yeah. [55:23] Gymnastics are important. Good thing we have a lot of footage of Olympics.
[55:30] What about, you mentioned Roblox. At what point do you think we'll have interactive video games that are generated in real time? [55:37] Yes, I think so. I'm very excited about it, actually. I think... [55:43] Like in one world... [55:44] I think the [55:46] the sort of next reasonable step for text to video. Like if you, if you think text to video is the continuation of text to image, I would say like a text to game is the continuation of text to video. Because, you know, with it, with a game, [56:02] you would you know you would essentially making the video interactive right that's that's kind of what that means and i actually think that [56:09] There is a world where this hyper... [56:12] hyper-casual games exist, but this is another level of hyper-casual where it's actually discardable. I think we're not too far away from that. I actually feel pretty... [56:24] pretty bullish on having like these, you know, one time playable games, like very short games. I think that's probably going to happen. I think that's a good use case for world models other than any other great use cases. But but I think I think it's going to happen. [56:41] What about AAA quality games? Will these models at least assist and change the development pipeline of those games? Yeah, I think they're already impacting. At least LLMs are impacting conversations. There's dynamic conversations, things like that. [56:56] I think... [56:57] pre-production stuff is impacted already.
[57:02] I think like [57:03] kind of side quest, like, IP stuff is... [57:06] impacted right like where you have the assets and you can make a minigame i think people are using it actually not very public but like that is already happening i think like [57:15] using for AAA production or like generating that with a model that's like [57:21] I don't know, at least like three, four years ahead for me. And yeah, it's... I mean, that would be insane if we can actually do that. But, you know, along the way there, just like the... [57:34] Just like the video space. I think along the way to the AAA, there's like many other things. And I think those are going to be very big. Yeah. [57:43] The video model space has just exploded in terms of options, quality, etc. As you look ahead towards what's needed to get us to the promised land for everything that generative media can be, do you think that there's future R&D breakthroughs that are needed on the horizon, like fundamental R&D breakthroughs, or do you think we're very much in the engineering scale-up leg of the race? [58:07] I think the architecture needs to slightly change, at least if you think about scaling these models by 10x, 100x, I think the architecture is a big... [58:15] bottleneck right now in terms of the inference efficiency, right? Like the more compression of the video space. [58:20] then that's definitely needed. We saw this with image. Image models used to be much less compressed. And then you were operating at the pixel space, and then we introduced latent space. And then even inside that latent space, you took 64 pixels and made them a single pixel. And now with video, we are compressing on a time dimension where we are seeing 4x ratios. Why not 24x or whatever? You need to increase that compression. And I think that's going to be a big driver of improving both inference efficiency
[58:50] efficiency but like i think like [58:53] Any model, like I think at this age that we are afraid, like any model you take... [58:56] On the generating media side, we're far from being like scaled up engineering wise. Like I think there's not enough investment being put into, or like it just started happening in the, within the past six months. Like Google showed this with like their models and how quickly they were able to catch up. They didn't need to innovate that much. It's just like, [59:13] They have the resources, they can put more effort into it. But at the same time, smaller labs are able to demonstrate this because there's so much unique and noble stuff that you can do at the data level to train these models. [59:25] So I think that's also like helping, contributing. And there's the factor of, you know, like outside, like, you know, mid-tier labs that raise like a hundred to a billion dollars. That's also trying to come up with models, releasing them open source or like contributing to the ecosystem. Yeah. [59:38] That's what's so exciting about this space. There's so much more work to do. So far, the research community did the simplest thing possible. They captioned images and trained the model on text-to-text prompt. And now we are doing video image editing that requires a lot more data engineering to create the data sets. But luckily, seemingly, we have a lot of abundant free video data. [1:00:08] a lot more work to do and a lot more room for improvement. - I mean, earlier on like Gerkem's math also indicates that, like if you wanna get to 4K video real time, that is like,
[1:00:20] I mean, that means like, I don't know, 100x, maybe more. [1:00:24] in [1:00:26] like compute or architecture something has to somebody has to give to to get us there right um [1:00:33] And yeah, like right now, a lot of models are like, [1:00:36] not that usable, like for professionals especially, right? Or even for like consumer, right? Like if you're sitting there, like for the best models, you still have to wait like 40 seconds. I don't know. Sometimes you have to wait. [1:00:51] two minutes, three minutes. That's not really acceptable in a world where we want everything [1:00:57] like on demand. So yeah, I think something needs to change. And probably pays off [1:01:04] Like, [1:01:04] hardware, [1:01:06] Getting faster is not enough. I think if that's the case, it'll take much longer. We'll have longer timelines. So I think architecture needs to get better. Awesome. Thank you, guys. You made a very high conviction bet on generative media as a theme, I think, way before it was obvious. I think we are just at the start of, I think, what's going to be an explosion of generative media. And it's been really cool to hear about everything you've built from the kernel optimizations and the compiler. [1:01:36] all the way up to the workflows and what you're seeing from customers with new and old media alike. And so thank you for joining us on the show today. Thank you. Thank you so much. Thanks for having me. This was a lot of fun. [1:01:47] Music.
Want to learn more?
Ask about this episode