293: Local AI

Download MP3
Arvid:

Welcome to the Bootstrap founder. I have a not so hyped yet AI topic for you today. We live in a day and age where using AI on our own servers has become possible, and I think that is magnificent. I will explain to you why. As a business owner and a software entrepreneur, I think that is just as interesting as it is important to consider setting up systems on our own back ends instead of relying on hosted platforms and APIs.

Arvid:

That's just platform risk. Right? And up until now, AI has been one of those things that you only get on somebody else's platform. If you're a software founder interested in using AI technology without depending on someone else's unit economics, this is for you today. We'll dive into running your own Jet GPT replacement for fun and more importantly I guess for profit.

Arvid:

I've been tinkering with this kind of tech over the last couple weeks to great success. So in the spirit of building in public, why not tell you what and how I did. In a my latest SaaS PodScan, I use 2 kinds of artificial intelligence. Large language model, like chat gpt that generates responses based on prompts. The transcription system is pretty cool, but it's not something that every software entrepreneur needs because it's really specific to converting audio into text.

Arvid:

And audio is a niche medium for most of us. But I would say all software founders work with text data in some shape or form, Somewhere in a database, right? Customer records, notes, instructions, it's all text. And founders got very excited back in the day when Chechipt came out, which now I guess is 2 years ago, which in terms of our industry is really just a blip. Right?

Arvid:

But still, people got super excited. All of a sudden, particularly once we could access that kind of service through an API, we could build on top of these amazingly smart language models. And that pioneering spirit has brought us to an interesting inflection point. Because the open in OpenAI, the company that spearheaded the whole chat GPT and all the GPT systems, that has been the catalyst for the open source community as well. It turns out that the most exciting development in recent years is not just the existence of the GPTs and chat GPT, but the fact that many universities, research groups, and companies have open sourced their code for training and running these models.

Arvid:

And when the nerds start building stuff together in the open, working on public data, working in public building in public for free and without restrictions, interesting things start to happen. And one of them is called llama.cpp that's llama.cpp, I guess. It is a cross platform framework that I found that is completely open source that allows us to train and run our own AI models on our consumer hardware that we own. Our computers, our laptops, our desktop systems, our servers if we want to. And we don't need to have the massive GPUs and RAM amounts that the big guys have, but we still get to run tech that's almost as good.

Arvid:

Obviously, if you have GPUs then that have hundreds of gigabytes of RAM you can run larger models, which is what GPT is, right? These, things have tens, if not 100 of billions of parameters that they're trained on, the models that run on consumer hardware, they have 7 or maybe 13,000,000,000 of parameters, which is still a lot of parameters, but, you know, it's it is just a slight difference there. But again, almost as good. And in most cases, for entrepreneurs, particularly when you're starting out building prototypes and stuff, it's good enough. The big contributor here is that we can avoid those costs and dependencies associated with using hosted platforms like OpenAI's API.

Arvid:

And in addition to that, or maybe because of it, we gain more control and flexibility over AI applications for our own businesses. So the risk goes down and control goes up. And that kind of is the indie founders dream, right? You derisk and you get more control. I think that's just just wonderful.

Arvid:

Let me share an example from just this week that I have found work to work really well with my own business. PodScan was until Wednesday a keyword alerting tool for podcasts. You would write down or configure a list of words in my user interface and PodScan would transcribe every newly released podcast out there as quickly as it could, then compare your list against the transcript and alert you if there was a match. Right? Just regular kind of keyword matching.

Arvid:

The magic in this product really is that it ingests all podcasts everywhere. But it was a fairly simple kind of checking for keywords. So far so good. And this is already creating massive value in the world of podcast discovery, which is severely underserved and it's kinda hard because audio is hard to introspect. But, now consider this from user perspective.

Arvid:

What if you don't know the keywords beforehand? What if you wanna be alerted for something that is more nebulous? Something like all podcasts where people talk about community events that are organized by women or all podcasts where people really nerd out about their favorite TV show. You can't just look for TV show as a keyword. That's just gonna give you everybody talking about every TV show in any capacity.

Arvid:

And even if you wanted to, you couldn't come up with all the specific keywords that would allow you to reliably match every podcast that falls into these categories either. But what if you could ask a simple question to each show? Questions like does this episode have nerds talk about sci fi in an excited way? And that's what local AI allowed me to build in like a day and a half, which is crazy. With the help of llama.

Arvid:

Cpp and a large language model called Mistral 7 b, the 7 b should not be surprising. I just mentioned that some of these models have 7,000,000,000 parameters. I set up a back end service that takes a transcript that the other AI in my system trades. It takes a question, and then it spits out either a yes or no. And that means that if the question is answered with a yes, is this a show hosted by nerds talking about sci fi?

Arvid:

Then you get an alert, but only then. And this system takes any transcript and any question and answers the question on that transcript in under a second per combination. And most importantly, it does this on the same hardware that my transcription servers are already running on. And they're just like gobbling up all these new podcast episodes and transcribing them In between transcriptions, I sometimes just don't do some inference, so it asked this question of the transcript. I don't have to count API calls.

Arvid:

I just have to have a computer with a GPU with like 8 gigabyte of RAM and a little bit more maybe, but that is enough. And Cloud hosting for GPUs right now is still super expensive. You pay around $500 per month for a single server with a GPU. But even Mac mini, which is like $800 can run this kind of AI inference at one question to a transcript per second. And it's quite literally both what I'm recording this on right now and where it's running in the background still.

Arvid:

Like, I'm recording a podcast episode on the computer that is just inferring and transcribing as I speak. That's the power of GPU based stuff. Right? The GPU is not needed for audio recording, so it can just steal pull in data and deal with audio while I speak. It's it's incredible.

Arvid:

And platforms like OpenAI and other competitors have started to compete there on price. They understand that people have started to run this locally too. OpenAI's API is a big mover here towards like lower prices. It's really, really affordable to use GPT 3.5. This is kind of the budget version of GPT because the current GPT, the most recent version is GPT 4 and it's super powerful, has a much more recent knowledge cut off, and it's it's more performant in certain things.

Arvid:

And GPD 3.5 is kind of what we had when we started out ish, right? It's good for any task that you need to scale on. And you can get millions of tokens, and those are words or characters for under a dollar. That's quite impressive, and let's be reasonable, I guess it fits most budgets if you just have to work through some text data some of the time. But it doesn't fit all budgets.

Arvid:

If you deal with lots of data, and you need to run prompts on that data constantly, like, if you were analyzing every English speaking podcast out there at all times, GPT 3.5 can cost tens, we have 100 of dollars per day. GPT 4 would easily go into the 1,000 if you were running it constantly. That's not scalable for a small business, but running your own local LLM? That sure is. Having AI on a server was impossible for a long time.

Arvid:

And the requirement of current GPU still makes it somewhat expensive, but, fortunately, AI comes in 2 forms, inference and training. Right? Like the the inference is kind of the applying a prompt and getting a reply, and training is setting up the model. And inference itself is surprisingly possible on regular old CPUs as well. Traditionally, over the last few years, all kinds of machine learning and AI work has been done on the GPU, because GPUs are designed for massive parallel computation, their graphics processing, they are supposed to create these really immersive game worlds.

Arvid:

And recent CPUs have added tensor cores, which handle specific mathematical operations that are also used in games and machine learnings and all kinds of other computation heavy things. So your computer's boring old CPU handles regular, more straightforward linear computations, and the GPU is much faster for certain parallel tasks. But in recent times, LLMs that used to require a GPU now run hilariously fast on a CPU as well. Mostly also because those CPUs have lots of cores. I recently rented the cloud server somewhere for, I think it has a GPU attached, so that's the reason why I rent it, but it has 32 CPU course that I can use for CPU computation as well.

Arvid:

It's crazy, right? The amount of course that you can do parallel processing on just shot up magnificently, so you can now do inference on those systems as well. You can run models, 7 b, 13 b models, even on your computer without a graphics card. And that's what most servers are. Right?

Arvid:

They are computer without GPUs, but with quite some RAM and a lot of CPU cores in there as well. And this change has led to the growth of an open source community creating local launch language model tools that can be used on both kind of chips. And that's the dot cpp and lama dot cpp. Right? It's it's c plus plus It signifies that these tools were meant for CPU based inference.

Arvid:

And whisper cpp, the speech to text sibling project also is meant to run on the CPU. And the open source nature of these projects has been supported by an unlikely ally. Meta, the company behind Facebook. The company who we all love for their privacy intrusion and that kind of stuff. They release an open source large language model called llama in 2023.

Arvid:

And that's pretty big for them to go into open source that much. I mean, they've done this with React and other projects, but, you know, it is pretty significant that the competitor, the OpenAI's competitor and that's that would be meta releases their models and their training data and all that for free. Because OpenAI's GPT models, they are proprietary. The company has published only research papers and no code. That inspired people to build their own models using public data.

Arvid:

There's even a benchmarking system to compare these self trained models, like Mistral 7 b, and all these many many others with the proprietary models like GPT 3 or GPT 4. And some of these new models come pretty close. Now, independent companies and open source communities release new large language models daily now that perform better than GPD 3.5. Maybe not as scalable, they don't have the infrastructure that OpenAI has with their their whole, like, server cluster, but the model itself performs better, more accurately and more reliably, sometimes faster. And these things almost perform as well as g p d 4, which is kind of the gold standard of these models right now in terms of speed and accuracy.

Arvid:

These open source models are available to everyone. That includes me and that includes you. And they help advance the field of AI language processing quite significantly. Because the moment we start using these things, we find the flaws. We report this on GitHub and all these pages that exist where these models are hosted.

Arvid:

And you will find those models on huggingface.co which is an interesting name for an interesting company but it's a website where you can download these open source language models in various forms. Whatever form you need, whatever tool you use you will find the model of your choice in that format there. I recommend following a person called Tom Jobins there. I think their nickname is the bloke. Tom's models come in all kinds of formats, and they've been reliably good to me.

Arvid:

I've been using models from that person's, page there quite a lot. People like Tom, they don't just upload the model. They share the source material, the training data, and all the files needed to run it on your own computer, which is really cool. That's really all that is to that. You download llama.

Arvid:

Cpp from, GitHub, you compile it and it's a cross platform thing. So it works on Windows, on Linux, on Mac, doesn't really matter. You download a model, which is just one download from hugging face dotco. I don't think you even need to be logged in there to download it. And then you're done.

Arvid:

LAMA lets you run a command on your computer or start a little server that loads a large language model into your graphics card memory, or your regular memory. And then it allows you to do local inference through an API. Like HTTP, you send an HTTP request, or you just put the link to the file or the link to a JSON file where you have some data in there and it's crazy. Even comes with an example page it's very simple to use. And even if you're not into AI and don't see an immediate use for it I highly recommend looking into this.

Arvid:

Llama automatically detects the best capacity your computer has for running inference, it checks if there's a GPU available, if you have the right drivers and if you have the toolkits installed that you need and then it uses these to maximize efficiency. And the community on the GitHub page is really really good too. So you will find a lot of help if you wanna set this up. You'll have a fast as fast and performant, I guess, an API on your own personal computer as you would if you were to use an API, and that's quite magical. And it's yours.

Arvid:

That's the thing, right? I think this is the year where software entrepreneurs learn how to wrangle control back onto their systems. And AI is one of the fields where this happens. It's gonna be a wild ride for sure and things change every day, but there's something incredibly powerful about knowing that OpenAI can implode tomorrow and shut down the API. But my local installation of Mistral 7b and a couple of cloud computers that have it running too, will still be mine to command.

Arvid:

They don't belong to anybody but me. I mean, it's the cloud, but somebody else's computer, but you know what I mean? Right? The installation, the software itself, something that I could just clone as an image and run somewhere else. That is mine.

Arvid:

Local AI is powerful, and it's here to stay. And that's it for today. I wanna briefly thank my sponsor acquire.com. Because one thing that stood out to me this week as well in reflecting on this, is how much more sellable having ownership over your back end magic, like AI, can make your business. And acquirer always buys your liabilities as well.

Arvid:

Right? That's what they buy. They buy the good and the bad, they buy the whole business. But if you have these powerful AI systems in house on your servers, their liabilities kind of reduced to keeping the servers running. And frankly, that is what they used to anyway, that's how acquirers work.

Arvid:

They know that software businesses need servers to keep running, right? So for them, the AI on the back end is a bonus and local AI, therefore is kind of an acquisition price boost for when you eventually sell your software business. And I guess there are many reasons to sell your business. Maybe you are done, maybe you're done with the work you built, what you wanted to build, you wanna do something else, you reached a skill ceiling, want a different lifestyle, whatever it is. No matter why you might want a change in your life selling your business at that point is probably a very impactful and positive experience, especially if you sell your SaaS on acquire.com.

Arvid:

The folks over there, they've helped thousands of people sell their valuable businesses for life changing amounts of money, and the things you don't have to sell today. But if you're ever interested in pivoting into financial security, you probably want to check out acquire.com. Go to try. Acquire.com/arvid, and see if this is for you and your business right now. It's always good to plan ahead.

Arvid:

Thank you for listening to the boots of founder today. You can find me on Twitter at Avidkar, aavidkahl, if I buy books on my Twitter course that too. If you wanna support me and this show, please subscribe to my YouTube channel, get the podcast and your player of choice, and leave a 5 star rating and a review by going to rate this podcast.com/founder. That makes a massive difference if you show up there because then the podcast will show up in other people's feeds, and this will help the show. Thank you so much for listening.

Arvid:

Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
293: Local AI
Broadcast by