[Jeremy Howard]
Introduction to Language Models
Hi, I am Jeremy Howard from fast.ai, and this is a hacker’s guide to language models. When I say a hacker’s guide, what we’re going to be looking at is a code-first approach to understanding how to use language models in practice. So before we get started, we should probably talk about what is a language model. I would say that this is going to make more sense if you know the kind of basics of deep learning. If you don’t, I think you’ll still get plenty out of it, and there’ll be plenty of things you can do. But if you do have a chance, I would recommend checking out course.fast.ai, which is a free course. And specifically, if you could at least kind of watch, if not work through the first five lessons, that would get you to a point where you understand all the basic fundamentals of deep learning that will make this lesson tutorial make even more sense.
[00:01:09]
Maybe I shouldn’t call this a tutorial, it’s more of a quick run-through. So I’m going to try to run through all the basic ideas of language models, how to use them, both open source ones and open AI based ones. And it’s all going to be based using code as much as possible. So let’s start by talking about what a language model is.
What is a Language Model?
And so as you might have heard before, a language model is something that knows how to predict the next word of a sentence, or knows how to fill in the missing words of a text. And we can play with it by passing in some words and ask it to predict what the next words might be. So if we pass in, when I arrived back at the panda breeding facility after the extraordinary rain of live frogs, I couldn’t believe what I saw.
[00:02:05]
I just came up with that yesterday and I thought what might happen next. So kind of fun for creative brainstorming. There’s a nice site called nat.dev. Nat.dev lets us play with a variety of language models. And here I’ve selected text DaVinci 003 and I’ll hit submit and it starts printing stuff up. The pandas were happily playing and eating the frogs that had fallen from the sky. It was an amazing sight to see these animals taking advantage of such a unique opportunity. The staff took quick measures to ensure the safety of the pandas and the frogs. So there you go, that’s what happened after the extraordinary rain of live frogs at the panda breeding facility. You’ll see here that I’ve enabled show probabilities, which is a thing in nat.dev, where it shows, well let’s take a look. It’s pretty likely the next word here is going to be the. And after the, since we’re talking about a panda breeding facility, it’s going to be pandas were.
[00:03:02]
And what were they doing? Well they could have been doing a few things. They could have been doing something happily, or the pandas were having, the pandas were out, the pandas were playing. So it picked the most likely. It thought it was 20% likely it’s going to be happily. And what were they happily doing? Could have been playing, hopping, eating, and so forth. So they’re eating the frogs that, and then had almost certainly. So you can see what it’s doing at each point is it’s predicting the probability of a variety of possible next words. And depending on how you set it up, it will either pick the most likely one every time, or you can change, muck around with things like p-values and temperatures to change what comes up. So, and each time then it’ll give us a different result. And this is kind of fun.
[00:04:03]
Frogs perched on the heads of some of the pandas. It was an amazing site, etc, etc. Okay, so that’s what a language model does. Now you might notice here it hasn’t predicted pandas, it’s predicted panned. And then separately, us. Okay, after it’s going to be us. So it’s not always a whole word. Here it’s un- and then harmed. Oh, actually it’s un-har-med. So you can see that it’s not always predicting words. Specifically what it’s doing is predicting tokens. Tokens are either whole words or sub-word units, pieces of a word, or it could even be punctuation or numbers or so forth.
[00:05:00]
So let’s have a look at how that works. So for example, we can use the actual, it’s called tokenization, to create tokens from a string. We can use the same tokenizer that GPT uses by using TickToken. And we can specifically say we want to use the same tokenizer that that model, TextEventually003, uses. And so, for example, when I earlier tried this, it talked about the frog splashing. And so I thought we’ll encode they are splashing. And the result is a bunch of numbers. And what those numbers are, they’re basically just lookups into a vocabulary that OpenAI, in this case, created. And if you train your own models, you’ll be automatically creating, or your code will create. And if I then decode those, it says, oh, these numbers are they, space, are, space, spool, ashing. And so put that all together, they are splashing.
[00:06:00]
So you can see that the start of a word is, with a space before it, is also being encoded here. So these language models are quite neat, that they can work at all. But they’re not of themselves really designed to do anything.
The ULMfit Algorithm
Let me explain. So the basic idea of what ChatGPT, GPT-4, BARD, etc. are doing comes from a paper which describes an algorithm that I created back in 2017, called ULMfit. And Sebastian Ruder and I wrote a paper up describing the ULMfit approach, which was the one that basically laid out what everybody’s doing, how this system works. And the system has three steps. Step one is language model training.
[00:07:02]
But you’ll see this is actually from the paper. We actually described it as pre-training. Now what language model pre-training does is this is the thing which predicts the next word of a sentence. And so in the original ULMfit paper, so the algorithm I developed in 2017, then Sebastian Ruder and I wrote it up in early 2018, what I originally did was I trained this language model on Wikipedia. Now what that meant is I took a neural network, and a neural network is just a function. If you don’t know what it is, it’s just a mathematical function that’s extremely flexible and it’s got lots and lots of parameters. And initially it can’t do anything. But using stochastic gradient descent, or SGD, you can teach it to do almost anything if you give it examples. And so I gave it lots of examples of sentences from Wikipedia. So for example, from the Wikipedia article for The Birds, The Birds is a 1963 American natural horror thriller film produced and directed by Alfred, and then it would stop.
[00:08:06]
And so then the model would have to guess what the next word is. And if it guessed Hitchcock, it would be rewarded. And if it guessed something else, it would be penalized. And effectively, basically, it’s trying to maximize those rewards. It’s trying to find a set of weights for this function that makes it more likely that it would predict Hitchcock. And then later on in this article, it reads from Wikipedia, any previously dated Mitch but ended it due to Mitch’s cold overbearing mother Lydia, who dislikes any woman in Mitch’s. Now you can see that filling this in actually requires being pretty thoughtful, because there’s a bunch of things that kind of logically could go there. Like a woman could be in Mitch’s closet, could be in Mitch’s house. And so you could probably guess in the Wikipedia article describing the plot of The Birds, it’s actually any woman in life.
[00:09:05]
Now, to do a good job of solving this problem, as well as possible, of guessing the next word of sentences, the neural network is going to have to learn a lot of stuff about the world. It’s going to learn that there are things called objects, that there’s a thing called time, that objects react to each other over time, that there are things called movies, that movies have directors, that there are people, that people have names, and so forth. And that a movie director is Alfred Hitchcock, and he directed horror films, and so on and so forth. It’s going to have to learn an extraordinary amount if it’s going to do a really good job of predicting the next word of sentences. Now, these neural networks, specifically, are deep neural networks. This is deep learning.
[00:10:00]
And in these deep neural networks, which have, when I created this, I think it had like 100 million parameters. Nowadays, they have billions of parameters. It’s got the ability to create a rich hierarchy of abstractions and representations, which it can build on. And so this is really the key idea behind neural networks and language models, is that if it’s going to do a good job of being able to predict the next word of any sentence in any situation, it’s going to have to know an awful lot about the world. It’s going to have to know about how to solve math questions, or figure out the next move in a chess game, or recognize poetry, and so on and so forth. Now, nobody said it’s going to do a good job of that. So it’s a lot of work to create and train a model that is good at that.
[00:11:02]
But if you can create one that’s good at that, it’s going to have a lot of capabilities internally that it would have to be drawing on to be able to do this effectively.
Compression and Intelligence
So the key idea here, for me, is that this is a form of compression. And this idea of the relationship between compression and intelligence goes back many, many decades. And the basic idea is that if you can guess what words are coming up next, then effectively you’re compressing all that information down into a neural network. Now, I said this is not useful of itself. Well, why do we do it? Well, we do it because we want to pull out those capabilities. And the way we pull out those capabilities is we take two more steps. The second step is we do something called language model fine-tuning. And in language model fine-tuning, we are no longer just giving it all of Wikipedia, or nowadays we don’t just give it all of Wikipedia, but in fact a large chunk of the internet is fed to pre-training these models.
[00:12:11]
In the fine-tuning stage, we feed it a set of documents a lot closer to the final task that we want the model to do. But it’s still the same basic idea. It’s still trying to predict the next word of a sentence. After that, we then do a final classifier fine-tuning. And in the classifier fine-tuning, this is the kind of end task we’re trying to get it to do. Now, nowadays these two steps are very specific approaches are taken. For the step two, the step B, the language model fine-tuning, people nowadays do a particular kind called instruction tuning. The idea is that the task we want most of the time to achieve is solve problems, answer questions.
[00:13:01]
And so in the instruction tuning phase, we use datasets like this one. This is a great dataset called OpenOrca, created by a fantastic open source group. And it’s built on top of something called the Flan Collection. And you can see that basically there’s all kinds of different questions in here. So there’s four gigabytes of questions and context and so forth. And each one generally has a question or an instruction or a request and then a response. Here are some examples of instructions. I think this is from the Flan dataset, if I remember correctly. So for instance, it could be, does the sentence in the Iron Age answer the question, the period of time from 1200 to 1000 BCE is known as what? Choice is one, yes or no. And then the model is meant to write one or two as appropriate, yes or no.
[00:14:06]
Or it could be things about, I think this is from a music video, who is the girl in More Than You Know answer, and then it would have to write the correct name of the, I can’t remember, model or dancer or whatever from that music video and so forth. So it’s still doing language modeling. So fine-tuning and pre-training are kind of the same thing. But this is more targeted now, not just to be able to fill in the missing parts of any document from the internet, but to fill in the words necessary to answer questions, to do useful things. Okay, so that’s instruction tuning. And then step three, which is the classifier fine-tuning. Nowadays, there’s generally various approaches such as reinforcement learning from human feedback and others, which are basically giving humans, or sometimes more advanced models, multiple answers to a question such as, here are some from a reinforcement learning from human feedback paper, I can’t remember which one I got it from, list five ideas for how to regain enthusiasm for my career.
[00:15:21]
And so the model will spit out two possible answers, or it will have a less good model and a more good model, and then a human or a better model will pick which is best. And so that’s used for the final fine-tuning stage. So all of that is to say, although you can download pure language models from the internet, they’re not generally that useful on their own until you’ve fine-tuned them. Now, you don’t necessarily need step C nowadays. Actually, people are discovering that maybe just step B might be enough.
[00:16:00]
It’s still a bit controversial. Okay, so when we talk about a language model, we could be talking about something that’s just been pre-trained, something that’s been fine-tuned, or something that’s gone through something like RLHF. All of those things are generally described nowadays as language models. So my view is that if you are going to be good at language modeling in any way, then you need to start by being a really effective user of language models.
GPT-4: The Best Language Model
And to be a really effective user of language models, you’ve got to use the best one that there is. And currently, so what are we up to, September 2023, the best one is by far GPT-4. This might change sometime in the not too distant future, but this is right now. GPT-4 is the recommendation, strong, strong recommendation. Now, you can use GPT-4 by paying 20 bucks a month to open AI, and then you can use it a whole lot.
[00:17:05]
It’s very hard to run out of credits, I find. Now, what can GPT do? It’s interesting and instructive, in my opinion, to start with the very common views you see on the internet, or even in academia, about what it can’t do. So for example, there was this paper you might have seen, GPT-4 can’t reason, which describes a number of empirical analysis done of 25 diverse reasoning problems, and found that it was not able to solve them. It’s utterly incapable of reasoning. So I always find you’ve got to be a bit careful about reading stuff like this, because I just took the first three that I came across in that paper, and I gave them to GPT-4.
[00:18:00]
And by the way, something very useful in GPT-4 is you can click on the share button, and you’ll get something that looks like this, and this is really handy. So here’s an example of something from the paper that said GPT-4 can’t do this. Mabel’s heart rate at 9am was 75 beats per minute. Her blood pressure at 7pm was 120 over 80. She died at 11pm. Was she alive at noon? Of course, she was human, we know obviously she must be. And GPT-4 says, hmm, this appears to be a riddle, not a real inquiry into medical conditions. Here’s a summary of the information, and yeah, it sounds like Mabel was alive at noon. So it’s correct. This was the second one I tried from the paper that says GPT-4 can’t do this, and I found actually GPT-4 can do this. And it said that GPT-4 can’t do this, and I found GPT-4 can do this.
[00:19:01]
Now, I mention this to say GPT-4 is probably a lot better than you would expect if you’ve read all this stuff on the internet about all the dumb things that it does. Almost every time I see on the internet saying something that GPT-4 can’t do, I check it and it turns out it does. This one was just last week. Sally, a girl, has three brothers. Each brother has two sisters. How many sisters does Sally have? So I have to think about it. And so GPT-4 says, okay, Sally’s counted as one sister by each of her brothers. If each brother has two sisters, that means there’s another sister in the picture apart from Sally. So Sally has one sister. Correct. And then this one I got sort of like three or four days ago.
[00:20:01]
This is a common view that language models can’t track things like this. Here’s the riddle. I’m in my house. On top of my chair in the living room is a coffee cup. Inside the coffee cup is a thimble. Inside the thimble is a diamond. I move the chair to the bedroom. I put the coffee cup on the bed. I turn the cup upside down. Then I return it upside up. Place the coffee cup on the counter in the So probably the diamond fell out. So therefore the diamond is in the bedroom where it fell out. Again, correct. Why is it that people are claiming that GPT-4 can’t do these things?
GPT-4’s Capabilities and Limitations
And it can. Well, the reason is because I think on the whole they are not aware of how GPT-4 was trained. GPT-4 was not trained at any point to give correct answers. GPT-4 was trained initially to give most likely next words, and there’s an awful lot of stuff on the internet where documents are not describing things that are true.
[00:21:12]
There could be fiction, there could be jokes, there This first stage does not necessarily give you correct answers. The second stage with the instruction tuning, it’s trying to give correct answers, but part of the problem is that then in the stage where you start asking people which answer do they like better, people tended to say in these things that they prefer more confident answers, and they often were not people who were trained well enough to recognize wrong answers. So there’s lots of reasons that the SGD weight updates from this process for stuff like GPT-4 don’t particularly, or don’t entirely, reward correct answers.
[00:22:06]
But you can help it want to give you correct answers if you think about the LM pre-training. What are the kinds of things in a document that would suggest, oh, this is going to be high quality information? And so you can actually prime GPT-4 to give you high quality information by giving it custom instructions. And what this does is this is basically text that is pre-pended to all of your queries. And so you say like, oh, you’re brilliant at reasoning. So like, okay, that’s obviously you have to prime it to give good answers. And then try to work against the fact that the RLHF folks preferred confidence.
[00:23:00]
Just tell it. No, tell me if there might not be a correct answer. Also, the way that the text is generated is it literally generates the next word. And then it puts all that whole lot back into the model and generates the next next word, puts that all back in the model, generates the next next next word, and so forth. That means more words it generates, the more computation it can do. And so I literally, I tell it that. Right? And so I say, first, spend a few sentences explaining background context, etc. So this custom instruction allows it to solve more challenging problems. And you can see the difference. Here’s what it looks like. For example, if I say, how do I get a count of rows grouped by value in pandas? And it just gives me a whole lot of information, which is actually it thinking.
[00:24:02]
So I just skip over it, and then it gives me the answer. And actually, in my custom instructions, I actually say, if the request begins with vv, actually make it as concise as possible. And so it kind of goes into brief mode. And here is brief mode. How do I get the group? This is the same thing, but with vv at the start. And it just spits it out. Now, in this case, it’s a really simple question, so I didn’t need time to think. So hopefully that gives you a sense of how to get language models to give good answers.
Getting Good Answers from Language Models
You have to help them. And if it’s not working, it might be user error, basically. But having said that, there’s plenty of stuff that language models like GPT-4 can’t do. One thing to think carefully about is, does it know about itself? Can you ask it, what is your context length?
[00:25:02]
How were you trained? What transformer architecture are you based on? At any one of these stages, did it have the opportunity to learn any of those things? Well, obviously not at the pre-training stage. Nothing on the internet existed during GPT-4’s training saying how GPT-4 was trained. Probably ditto in the instruction tuning. Probably ditto in the RLHF. So in general, you can’t ask, for example, a language model about itself. Now, again, because of the RLHF, it’ll want to make you happy by giving you opinionated answers. So it’ll just spit out the most likely thing it thinks with great confidence. This is just a general kind of hallucination. So hallucinations is just this idea that the language model wants to complete the sentence and it wants to do it in an opinionated way that’s likely to make people happy.
[00:26:02]
It doesn’t know anything about URLs. It really hasn’t seen many at all. I think a lot of them, if not all of them, pretty much were stripped out. So if you ask it anything about what’s at this web page, again, it’ll generally just make it up. And it doesn’t know, at least GPT-4, doesn’t know anything after September 2021 because the information it was pre-trained on was from that time period, September 2021 and before, called the knowledge cutoff. So here are some things it can’t do. Steve Newman sent me this good example of something that it can’t do. Here is a logic puzzle. I need to carry a cabbage, a goat, and a wolf across a river. I can only carry one item at a time. I can’t leave the goat with a cabbage. I can’t leave the cabbage with the wolf. How do I get everything across to the other side?
[00:27:00]
Now the problem is, this looks a lot like something called the classic river-crossing puzzle. So classic, in fact, that it has a whole Wikipedia page about it. And in the classic puzzle, the wolf would eat the goat, or the goat would eat the cabbage. Now in Steve’s version, he changed it. The goat would eat the cabbage and the wolf would eat the cabbage, but the wolf won’t eat the goat. So what happens? Well, very interestingly, GPT-4 here is entirely overwhelmed by the language model training. It’s seen this puzzle so many times, it knows what word comes next. So it says, oh yeah, I take the goat across the river and leave it on the other side, leaving the wolf with a cabbage, but we’re just told you can’t leave the wolf with a cabbage.
[00:28:04]
So it gets it wrong. Now the thing is though, you can encourage GPT-4 or any of these language models to try again. So during the instruction tuning in RLHF, they’re actually fine-tuned with multi-stage conversations. So you can give it a multi-stage conversation. Repeat back to me the constraints I listed. What happened after step one? Is a constraint violated? Oh yeah, yeah, yeah, I made a mistake. Okay, my new attempt, instead of taking the goat across the river and leaving it on the other side, is I’ll take the goat across the river and leave it on the other side. It’s done the same thing. Oh yeah, I did do the same thing. Okay, I’ll take the wolf across. Well, now the goat’s with the cabbage. That still doesn’t work. Oh yeah, that didn’t work either. Sorry about that. Instead of taking the goat across the other side, I’ll take the goat across the other side.
[00:29:01]
Okay, what’s going on here, right? This is terrible. Well, one of the problems here is that not only is on the internet it’s so common to see this particular goat puzzle that it’s so confident it knows what the next word is. Also on the internet, when you see stuff which is stupid on a web page, it’s really likely to be followed up with more stuff that is stupid. Once GPT-4 starts being wrong, it tends to be more and more wrong. It’s very hard to turn it around to start making it be right. So you actually have to go back and there’s actually an edit button on these chats. And so what you generally want to do is, if it’s made a mistake, is don’t say, here’s more information to help you fix it.
[00:30:03]
But instead go back and click the edit and change it here. And so this time it’s not going to get confused. So in this case, actually fixing Steve’s example takes quite a lot of effort. But I think I’ve managed to get it to work eventually. And I actually said, oh, sometimes people read things too quickly. They don’t notice things. It can trick them up. Then they apply some pattern, get the wrong answer. You do the same thing, by the way. So I’m going to trick you. So before you’re about to get tricked, make sure you don’t get tricked. Here’s the tricky puzzle. And then also with my custom instructions, it takes time discussing it. And this time it gets it correct. It takes the cabbage across first. So it took a lot of effort to get to a point where it could actually solve this.
[00:31:03]
Because yeah, when it’s, you know, for things where it’s been primed to answer a certain way again and again and again, it’s very hard for it to not do that.
Advanced Data Analysis (Code Interpreter)
Okay. Now something else super helpful that you can use is what they call advanced data analysis. In advanced data analysis, you can ask it to basically write code for you. And we’re going to look at how to implement this from scratch ourself quite soon. But first of all, let’s learn how to use it. So I was trying to build something that split into markdown headings, a document on third level markdown headings. So that’s three hashes at the start of a line. And I was doing it on the whole of Wikipedia. So using regular expressions was really slow. So I said, oh, I want to speed this up. And it said, okay, here’s some code, which is great because then I can say, test it and include edge cases.
[00:32:05]
And so it then puts in the code, creates the edge cases, tests it, and it says, yep, it’s working. However, I’ve discovered it’s not. I noticed it’s actually removing the carriage return at the end of each sentence. So I said, oh, fix that and update your tests. So it said, okay. So now it’s changed the test, updated the test cases. Let’s run them and, oh, it’s not working. So it says, oh yeah, fix the issue in the test cases. Nope, that didn’t work. And you can see it’s quite clever the way it’s trying to fix it by looking at the results. But as you can see, it’s not. Every one of these is another attempt, another attempt, another attempt, until eventually I gave up waiting.
[00:33:01]
And it’s so funny. Each time it’s like debugging again. Okay, this time I’ve got to handle it properly. And I gave up at the point where it’s like, oh, one more attempt. So it didn’t solve it, interestingly enough. And, you know, I, again, it’s, it’s, there’s some limits to the amount of kind of logic that it can do. This is really a very simple question I asked it to do for me. And so hopefully you can see you can’t expect even GPT-4 code interpreter, or advanced data analysis as it’s now called, to make it so you don’t have to write code anymore. You know, it’s not a substitute for having programmers. So, but it can, you know, it can often do a lot, as I’ll show you in a moment. So for example, actually, OCR, like this is something I thought was really cool.
[00:34:00]
You can just paste and, sorry, paste to upload. So GPT-4, you can upload an image. Advanced data analysis, yeah, you can upload an image here. And then I wanted to basically grab some text out of an image. Somebody had got a screenshot with their screen, and I wanted to, which was something saying, oh, this language model can’t do this. And I wanted to try it as well. So rather than retyping it, I just uploaded that image, my screenshot, and said, can you extract the text from this image? And it said, oh, yeah, I could do that. I could use OCR. So it literally wrote an OCR script, and there it is. Just took a few seconds. So the difference here is it didn’t really require to think of much logic. It could just use a very, very familiar pattern that it would have seen many times. So this is generally where I find language models excel, is where it doesn’t have to think too far outside the box.
[00:35:06]
I mean, it’s great on kind of creativity tasks, but for reasoning and logic tasks that are outside the box, I find it not great. But yeah, it’s great at doing code for a whole wide variety of different libraries and languages. Having said that, by the way, Google also has a language model called BARD. It’s way less good than GPT-4 most of the time, but there is a nice thing that you can literally paste an image straight into the prompt, and I just typed OCR this, and it didn’t even have to go through Code Interpreter or whatever. It just said, oh, sure, I’ve done it, and there’s the result of the OCR. And then it even commented on what it just OCR’d, which I thought was cute. And oh, even more interestingly, it even figured out where the OCR text came from, and gave me a link to it.
[00:36:00]
So I thought that was pretty cool. Okay, so there’s an example of it doing well. I’ll show you one for this talk I found really helpful. I wanted to show you guys how much it costs to use the OpenAI API. But unfortunately, when I went to the OpenAI web page, it was, like, all over the place. The pricing information was on all separate tables, and it was kind of a bit of a mess. So I wanted to create a table with all of the information combined, like this. And here’s how I did it. I went to the OpenAI page, I hit Apple A to select all, and then I said in ChatJPT, create a table with the pricing information rows, no summarization, no information not in this page, every row should appear as a separate row in your output, and I hit paste.
[00:37:00]
Now that was not very helpful to it, because hitting paste, it’s got the nav bar, it’s got lots of extra information at the bottom, it’s got all of its footer, etc. But it’s really good at this stuff. It did it first time. So there was the markdown table. So I copied and pasted that into Jupyter, and I got my markdown table. And so now you can see at a glance the cost of GPT-4, 3.5, etc. But then what I really wanted to do was show you that as a picture. So I just said, oh, chart the input row from this table, and just pasted the table back. And it did. So that’s pretty amazing. Now, so let’s talk about this pricing.
OpenAI API Pricing
So, so far we’ve used ChatJPT, which costs 20 bucks a month, and there’s no like per token cost or anything.
[00:38:00]
But if you want to use the API from Python or whatever, you have to pay per token, which is approximately per word. Maybe it’s about one and a third tokens per word on average. Unfortunately in the chart, it did not include these headers GPT-4 or GPT-3.5. So these first two ones are GPT-4, and these two are GPT-3.5. So you can see the GPT-3.5 is way, way cheaper. And you can see it here. It’s 0.03 versus 0.0015. So it’s so cheap you can really play around with it and not worry. And I want to give you a sense of what that looks like. Okay, so why would you use the OpenAI API rather than ChatJPT? Because you can do it programmatically. So you can, you know, you can analyze datasets, you can do repetitive stuff.
[00:39:02]
It’s kind of like a different way of programming, you know. It’s things that you can think of describing. But let’s just look at the most simple example of what that looks like. So if you pip install OpenAI, then you can import ChatCompletion. And then you can say, okay, ChatCompletion.create using GPT-3.5 Turbo. And then you can pass in a system message. This is basically the same as custom instructions. So, okay, you’re an Aussie LLM that uses Aussie slang and analogies wherever possible. Okay, and so you can see I’m passing in an array here of messages. So the first is the system message, and then the user message, which is, what is money? Okay, so GPT-3.5 returns a big embedded dictionary. And the message content is, well, money is like the oil that keeps the machinery of our economy running smoothly.
[00:40:02]
There you go. Just like a koala loves its eucalyptus leaves, we humans can’t survive without this stuff. So there’s the Aussie LLM’s view of what is money. So really, the main ones I pretty much always use are GPT-4 and GPT-3.5. GPT-4 is just so, so much better at anything remotely challenging, but obviously it’s much more expensive. So rule of thumb, you know, maybe try 3.5 Turbo first. See how it goes. If you’re happy with the results, then great. If you’re not, pony out for the more expensive one. Okay, so I just created a little function here called response that will print out this nested thing. And so now, oh, and so then the other thing to point out here is that the result of this also has a usage field, which contains how many tokens was it?
[00:41:04]
So it’s about 150 tokens. So at 0.002 dollars per thousand tokens, for 150 tokens, means we just paid 0.03 cents, 0.0003 dollars to get that done. So as you can see, the cost is insignificant. If we were using GPT-4, it would be 0.03 per thousand, so it’d be half a cent. So unless you’re doing many thousands of GPT-4, you’re not going to be even up into the dollars, and GPT-3.5 even more than that. But you know, keep an eye on it. OpenAI has a usage page, and you can track your usage. Now, what happens when we are, this is really important to understand, when we have a follow-up in the same conversation?
[00:42:07]
How does that work? So we just asked what goat means. So for example, Michael Jordan is often referred to as the goat for his exceptional skills and accomplishments, and Elvis and the Beatles are referred to as goat due to their profound influence and achievement. So I could say, what profound influence and achievements are you referring to? Okay, well I meant Elvis Presley and the Beatles did all these things. Now how does that work? How does this follow-up work? Well, what happens is the entire conversation is passed back, and so we can actually do that here. So here is the same system prompt, here is the same question, right?
[00:43:06]
And then the answer comes back with role assistant, and I’m going to do something pretty cheeky. I’m going to pretend that it didn’t say money is like oil. I’m going to say, oh you actually said money is like kangaroos. I thought, what is it going to do? Okay, so you can like literally invent a conversation in which the language model said something different, because this is actually how it’s done. In a multi-stage conversation, there’s no state, right? There’s nothing stored on the server. You’re passing back the entire conversation again and telling it what it told you, right? So I’m going to tell it, it told me that money is like kangaroos, and then I’ll ask the user, oh really? In what way? And this is kind of cool, because you can like see how it convinces you of something I just invented. Oh, let me break it down for you, Cobber.
[00:44:00]
Just like kangaroos hop around and carry their joeys in their pouch, money is a means of carrying value around. So there you go, it’s make your own analogy. Cool, so I’ll create a little function here that just puts these things together for us. System message if there is one, the user message, and returns the completion. And so now we can ask it, what’s the meaning of life? Passing in the Aussie system prompt. The meaning of life is like trying to catch a wave on a sunny day at Bondi Beach. Okay, there you go. So what do you need to be aware of? Well, as I said, one thing is keep an eye on your usage. If you’re doing it, you know, hundreds or thousands of times in a loop, keep an eye on not spending too much money. But also if you’re doing it too fast, particularly the first day or two you’ve got an account, you’re likely to hit the limits for the API. And so the limits initially are pretty low.
[00:45:02]
As you can see, three requests per minute. So that’s for free users, paid users, first 48 hours. And after that it starts going up, you can always ask for more. I just mention this because you’re going to want to have a function that keeps an eye on that. And so what I did is I actually just went to Bing, which has a somewhat crappy version of GPT-4 nowadays, but it can still do basic stuff for free. And I said, please show me Python code to call the OpenAI API and handle rate limits. And it wrote this code. It’s got a try, checks for rate limit errors, grabs the retry after, sleeps for that long, and calls itself. And so now we can use that to ask, for example, what’s the world’s funniest joke?
[00:46:01]
And there we go. There’s the world’s funniest joke. So there’s like the basic stuff you need to get started using the OpenAI LLMs. And yeah, it’s definitely suggest spending plenty of time with that so that you feel like you’re really a LLM using expert. So what else can we do?
Building a Code Interpreter in Jupyter
Well, let’s create our own code interpreter that runs inside Jupyter. And so to do this, we’re going to take advantage of a really nifty thing called function calling, which is provided by the OpenAI API. And in function calling, when we call our askGPT function, which is this little one here, we had room to pass in some keyword arguments that will be just passed along to chatCompletion.create. And one of those keyword arguments you can pass is functions.
[00:47:12]
What on earth is that? Functions tells OpenAI about tools that you have, about functions that you have. So for example, I created a really simple function called sums. And it adds two things. In fact, it adds two ints. And I’m going to pass that function to chatCompletion.create. Now you can’t pass a Python function directly. You actually have to pass what’s called the JSON schema. So you have to pass the schema for the function. So I created this nifty little function that you’re welcome to borrow, which uses Pydantic and also Python’s inspect module to automatically take a Python function and return the schema for it.
[00:48:15]
And so this is actually what’s going to get passed to OpenAI. So it’s going to know that there’s a function called sums. It’s going to know what parameters it takes, what the defaults are, and what’s required. So this is like, when I first heard about this, I found this a bit mind-bending, because this is so different to how we normally program computers, where the key thing for programming the computer here actually is the docstring. This is the thing that GPT-4 will look at and say, oh, what does this function do? So it’s critical that this describes exactly what the function does. And so if I then say, what is 6 plus 3?
[00:49:00]
I really wanted to make sure it actually did it here, so I gave it lots of prompts to say, because obviously it knows how to do it itself without calling sums. So it’ll only use your functions if it feels it needs to, which is a weird concept. I mean, I guess feels is not a great word to use, but you kind of have to anthropomorphize these things a little bit, because they don’t behave like normal computer programs. So if I ask GPT, what is 6 plus 3, and tell it that there’s a function called sums, then it does not actually return the number 9. Instead it returns something saying, please call a function. Call this function and pass it these arguments. So if I print it out, there’s the arguments. So I created a little function called call function, and it goes into the result of OpenAI, grabs the function call, checks that the name is something that it’s allowed to do, grabs it from the global system table, and calls it, passing in the parameters.
[00:50:10]
And so if I now say, okay, call the function that we got back, we finally get 9. So this is a very simple example. It’s not really doing anything that useful, but what we could do now is we can create a much more powerful function called Python. And the Python function executes code using Python and returns the result. Now, of course, I didn’t want my computer to run arbitrary Python code that GPT-4 told it to without checking, so I just got it to check first. So now I can say, ask GPT, what is 12 factorial?
[00:51:09]
System prompt, you can use Python for any required computations, and say, okay, here’s a function you’ve got available. It’s the Python function. So if I now call this, it will pass me back again a completion object, and here it’s going to say, okay, I want you to call Python passing in this argument. And when I do, it’s going to go import math, result equals blah, and then return result. Do I want to do that? Yes, I do. And there it is. Now, there’s one more step which we can optionally do. I mean, we’ve got the answer we wanted, but often we want the answer in more of a chat format. And so the way to do that is to, again, repeat everything that you’ve passed into so far, but then instead of adding an assistant role response, we have to provide a function role response, and simply put in here the result we got back from the function.
[00:52:16]
And if we do that, we now get the prose response, 12 factorial is equal to 470 and a million, 1,600. Now, functions like Python, you can still ask it about non-Python things, and it just ignores it if you don’t need it, right? So you can have a whole bunch of functions available that you’ve built to do whatever you need for the stuff which the language model isn’t familiar with. And it’ll still solve whatever it can on its own, and use your tools, use your functions where possible.
[00:53:08]
Okay. So we have built our own code interpreter from scratch. I think that’s pretty amazing. So that is what you can do with, or some of the stuff you can do with OpenAI.
Using Language Models on Your Own Computer
What about stuff that you can do on your own computer? Well, to use a language model on your own computer, you’re going to need to use a GPU. So I guess the first thing to think about is like, do you want this? Does it make sense to do stuff on your own computer? What are the benefits? There are not any open source models that are as good yet as GPT-4.
[00:54:06]
And I would have to say also, like, actually OpenAI’s pricing’s really pretty good. So it’s not immediately obvious that you definitely want to kind of go in-house, but there’s lots of reasons you might want to. And we’ll look at some examples of them today. One example you might want to go in-house is that you want to be able to ask questions about your proprietary documents, or about information after September 2021, the knowledge cutoff. Or you might want to create your own model that’s particularly good at solving the kinds of problems that you need to solve using fine tuning. And these are all things that you absolutely can get better than GPT-4 performance at work or at home without too much money or trouble. So these are the situations in which you might want to go down this path.
[00:55:00]
And so you don’t necessarily have to buy a GPU. On Kaggle they will give you a notebook with two quite old GPUs attached, and very little RAM. But it’s something. Or you can use Colab, and on Colab you can get much better GPUs than Kaggle has, and more RAM, particularly if you pay a monthly subscription fee. So those are some options for free or low cost. You can also of course go to one of the many GPU server providers, and they change all the time as to what’s good or what’s not. RunPod is one example. And you can see, you know, if you want the biggest and best machine, you’re talking $34 an hour.
[00:56:00]
So it gets pretty expensive. But you can certainly get things a lot cheaper. 80 cents an hour. Lambda Labs is often pretty good. You know, it’s really hard at the moment to actually find people that have them available. So they’ve got lots listed here, but they often have none or very few available. There’s also something pretty interesting called FastAI, which basically lets you use other people’s computers when they’re not using them. And as you can see, you know, they tend to be much cheaper than other folks. And then they tend to have better availability as well. But of course, for sensitive stuff, you don’t want to be running it on some rando’s computer.
[00:57:02]
So anyway, so there’s a few options for renting stuff. You know, I think it’s, if you can, it’s worth buying something. And definitely the one to buy at the moment is the GTX 3090 used. You can generally get them from eBay for like 700 bucks or so. A 4090 isn’t really better for language models, even though it’s a newer GPU. The reason for that is that language models are all about memory speed. How quickly can you get in and stuff in and out of memory, rather than how fast is the processor. And that hasn’t really improved a whole lot. So the 2000 bucks. The other thing as well as memory speed is memory size. 24 gigs, it doesn’t quite cut it for a lot of things. So you’d probably want to get two of these GPUs. So you’re talking like $1,500 or so. Or you can get a 48 gig RAM GPU. It’s called an A6000. But this is going to cost you more like five grand.
[00:58:03]
So again, getting two of these is going to be a better deal. And this is not going to be faster than these either. Or funnily enough, you could just get a Mac with a lot of RAM, particularly if you get an M2 Ultra. Macs have, particularly the M2 Ultra, has pretty fast memory. It’s still going to be way slower than using an NVIDIA card. But it’s going to be like, you’re going to be able to get, you know, like I think 192 gig or something. So it’s not a terrible option, particularly if you’re not training models. You’re just wanting to use other existing trained models. So anyway, most people who do this stuff seriously, almost everybody, has NVIDIA cards.
HuggingFace Transformers
So then what we’re going to be using is a library called Transformers from HuggingFace.
[00:59:04]
And the reason for that is that basically people upload lots of pre-trained models or fine-tuned models up to the HuggingFace hub. And in fact, there’s even a leaderboard where you can see which are the best models. Now this is a really fraught area. So at the moment, this one is meant to be the best model. It has the highest average score. And maybe it is good. I haven’t actually used this particular model. Or maybe it’s not. I actually have no idea, because the problem is these metrics are not particularly well aligned with real-life usage for all kinds of reasons. And also sometimes you get something called leakage, which means that sometimes some of the questions from these things actually leaks through to some of the training sets.
[01:00:01]
So you can get as a rule of thumb what to use from here, but you should always try things. And you can also say, you know, these ones are all the 70b here. That tells you how big it is. So this is a 70 billion parameter model. So generally speaking, for the kinds of GPUs we’re talking about, you’ll be wanting no bigger than 13b, and quite often 7b. So let’s see if we can find… a 13b model, for example. So you can find models to try out from things like this leaderboard.
Model Leaderboards
And there’s also a really great leaderboard called FastEval, which I like a lot, because it focuses on some more sophisticated evaluation methods, such as this chain of thought evaluation method. So I kind of trust these a little bit more.
[01:01:01]
And these are also GSM8k is a difficult math benchmark, BigBenchHard, and so forth. So yeah, so, you know, StableBeluga2, WizardMath13b, DolphinLlama13b, etc. These would all be good options. Yeah, so you need to pick a model, and at the moment nearly all the good models are based on Meta’s Llama2.
Llama2: A Popular Open Source Model
So when I say based on, what does that mean? Well, what that means is this model here, Llama2 7b, so it’s a Llama model. That’s just the name Meta called it. This is their version 2 of Llama. This is their 7 billion size one. It’s the smallest one that they make. And specifically these weights have been created for HuggingFace, so you can load it with the HuggingFace transformers. And this model has only got as far as here.
[01:02:00]
It’s done the language model for pre-training. It’s done none of the instruction tuning, and none of the RLHF. So we would need to fine-tune it to really get it to do much useful. So we can just say, okay, create a, automatically create the appropriate model for language model. So causal lm is basically refers to that ULM fit stage 1 process, or stage 2 in fact. So get the pre-trained model from this name, Meta Llama Llama2, blah blah. Okay, now generally speaking we use 16-bit floating point numbers nowadays. But if you think about it, 16-bit is 2 bytes. So 7b times 2, it’s going to be 14 gigabytes just to load in the weights.
[01:03:00]
So you’ve got to have a decent model to be able to do that. Perhaps surprisingly, you can actually just cast it to 8-bit, and it still works pretty well, thanks to something called discretization.
Using 8-bit and GPTQ for Faster Inference
So let’s try that. So remember this is just a language model. It can only complete sentences. We can’t ask it a question and expect a great answer. So let’s just give it the start of a sentence. Jeremy, how it is R. And so we need the right tokenizer. So this will automatically create the right kind of tokenizer for this model. We can grab the tokens as PyTorch. Here they are. And just to confirm, if we decode them back again, we get back the original plus a special token to say this is the start of a document. And so we can now call generate. So generate will auto-regressively, so call the model again and again, passing its previous result back as the next as the next input.
[01:04:08]
And I’m just going to do that 15 times. So this is, you can write this for loop yourself. This isn’t doing anything fancy. In fact, I would recommend writing this yourself to make sure that you know how, that it all works okay. We have to put those tokens on the GPU. And at the end, I recommend putting them back onto the CPU, the result. And here are the so we have to decode them using the tokenizer. And so the first 25, sorry, first 15 tokens are Jeremy, how it is R. 28 year old Australian AI researcher and entrepreneur. Okay, well, 28 years old is not exactly correct, but we’ll call it close enough. I like that. Thank you very much, llama7b. So okay, so we’ve got a language model completing sentences. It took one and a third seconds.
[01:05:01]
And that’s a bit slower than it could be because we used 8-bit. If we use 16-bit, there’s a special thing called bfloat16, which is a really great 16-bit floating point format that’s usable on any somewhat recent NVIDIA GPU. Now, if we use it, it’s going to take twice as much RAM as we discussed, but look at the time. It’s come down to 390 milliseconds. Now, there is a better option still than even that. There’s a different kind of discretization called GPTQ, where a model is carefully optimized to work with four or eight or other, you know, lower precision data automatically. And this particular person known as the bloke is fantastic at taking popular models, running that optimization process, and then uploading the results back to HackingFace.
[01:06:06]
So we can use this GPTQ version, and internally this is actually going to use, I’m not sure exactly how many bits this particular one is, I think it’s probably going to be four bits, but it’s going to be much more optimized. And so look at this, 270 milliseconds. It’s actually faster than 16-bit, even though internally it’s actually casting it up to 16-bit each layer to do it. And that’s because there’s a lot less memory moving around. And to confirm, in fact, what we could even do now is we could go up to 13-bit. Easy. And in fact, it’s still faster than the 7-bit now that we’re using the GPTQ version. So this is a really helpful tip. So let’s put all those things together. The tokenizer, the generate, the batch decode, we’ll call this gen for generate. And so we can now use the 13-bit GPTQ model.
[01:07:03]
And let’s try this. Jeremy Howard is a, so it’s got to 50 tokens, so fast, 16-year veteran of Silicon Valley, co-founder of Kaggle, a marketplace for predictive model. His company, Kaggle.com, has become a data science competition. I don’t know what I was going to say, but anyway, it’s on the right track. I was actually there for 10 years, not 16, but that’s all right. Okay, so this is looking good.
Instruction Tuned Models
But probably a lot of the time we’re going to be interested in, you know, asking questions or using instructions. So Stability AI has this nice series called Stable Beluga, including a small 7b one and other bigger ones. And these are all based on LLAMA2, but these have been instruction tuned. They might even have been RLHDFed, I can’t remember now. So we can create a Stable Beluga model. And now something really important that I keep forgetting, everybody keeps forgetting, is during the instruction tuning process, during the instruction tuning process, the instructions that are passed in actually are, they don’t just appear like this.
[01:08:26]
They actually always are in a particular format. And the format, believe it or not, changes quite a bit from fine tune to fine tune. And so you have to go to the webpage for the model and scroll down to find out what the prompt format is. So here’s the prompt format. So I generally just copy it and then I paste it into Python, which I did here, and created a function called make prompt that used the exact same format that it said to use.
[01:09:08]
And so now if I want to say who is Jeremy Howard, I can call gen again, that was that function I created up here, and make the correct prompt from that question. And then it returns back. Okay, so you can see here all this prefix, this is the system instruction, this is my question, and then the assistant says Jeremy Howard’s an Australian entrepreneur, computer scientist, co-founder of machine learning and deep learning company Faster.ai. Okay, so this one’s actually all correct. So it’s getting better by using an actual instruction tune model.
Scaling Up with Larger Models
And so we could then start to scale up. So we could use the 13b. And in fact, we looked briefly at this OpenOrca dataset earlier. So Llama2 has been fine-tuned on OpenOrca, and then also fine-tuned on another really great dataset called Platypus.
[01:10:04]
And so the whole thing together is the OpenOrca, Platypus, and then this is going to be the bigger 13b. GPTQ means it’s going to be quantized. So that’s got a different format. Okay, a different prompt format. So again, we can scroll down and see what the prompt format is. There it is. Okay, and so we can create a function called makeOpenOrcaPrompt that has that prompt format. And so now we can say, okay, who is Jeremy Howard? And now I’ve become British, which is kind of true. I was born in England, but I moved to Australia. A professional poker player. No, definitely not that. Co-founding several companies, including Faster.ai, also Kaggle. Okay, so not bad. It was acquired by Google. Was it 2017? Probably something around there. Okay, so you can see we’ve got our own models giving us some pretty good information.
[01:11:09]
Retrieval Augmented Generation (RAG)
How do we make it even better? You know, because it’s still hallucinating, you know. And, you know, LLAMA2, I think, has been trained with more up-to-date information than GPT-4. It doesn’t have the September 2021 cutoff. But it, you know, it’s still got a knowledge cutoff. You know, we would like to be able to use the most up-to-date information. We want to use the right information to answer these questions as well as possible. So to do this, we can use something called retrieval augmented generation. So what happens with retrieval augmented generation is when we take the question we’ve been asked, like who is Jeremy Howde, and then we say, okay, let’s try and search for documents that may help us answer that question.
[01:12:07]
So obviously we would expect, for example, Wikipedia to be useful. And then what we do is we say, okay, with that information, let’s now see if we can tell the language model about what we found and then have it answer the question. So let me show you. So let’s actually grab a Wikipedia Python package. We will scrape Wikipedia, grabbing the Jeremy Howde webpage. And so here’s the start of the Jeremy Howde Wikipedia page. It has 613 words. Now, generally speaking, these open source models will have a context length of about 2000 or 4000. So the context length is how many tokens can it handle. So that’s fine.
[01:13:01]
It’ll be able to handle this webpage. And what we’re going to do is we’re going to ask it the question. So we’re going to have here a question and with a question. But before it, we’re going to say answer the question with the help of the context. We’re going to provide this to the language model. And we’re going to say context, and they’re going to have the whole webpage. So suddenly now our question is going to be a lot bigger, that prompt, right? So our prompt now contains the entire webpage, the whole Wikipedia page, followed by our question. And so now it says, Jeremy Howde is an Australian data scientist, entrepreneur, and educator, known for his work in deep learning, co-founder of Fast.ai, teaches courses, develops software, conducts research, used to be, yeah, okay, it’s perfect. Right? So it’s actually done a really good job. Like if somebody asked me to send them a, you know, 100 word bio, that would actually probably be better than I would have written myself.
[01:14:04]
And you’ll see, even though I asked for 300 tokens, it actually got sent back the end of stream token. And so it knows to stop at this point. Well, that’s all very well, but how do we know to pass in the Jeremy Howde Wikipedia page?
Sentence Transformers for Document Retrieval
Well, the way we know which Wikipedia page to pass in is that we can use another model to tell us which webpage or which document is the most useful for answering a question. And the way we do that is we can use something called sentence transformer, and we can use a special kind of model that’s specifically designed to take a document and turn it into a bunch of activations where two documents that are similar will have similar activations.
[01:15:01]
So let me just, let me show you what I mean. What I’m going to do is I’m going to grab just the first paragraph of my Wikipedia page, and I’m going to grab the first paragraph of Tony Blair’s Wikipedia page. Okay. So we’re pretty different people, right? This is just like a really simple, small example. And I’m going to then call this model, I’m going to say encode, and I’m going to encode my Wikipedia first paragraph, Tony Blair’s first paragraph, and the question, which was, who is Jeremy Howde? And it’s going to pass back a 384 long vector of embeddings for the question, for me, and for Tony Blair. And what I can now do is I can calculate the similarity between the question and the Jeremy Howde Wikipedia page. And I can also do it for the question versus the Tony Blair Wikipedia page.
[01:16:02]
And as you can see, it’s higher for me. And so that tells you that if you’re trying to figure out what document to use to help you answer this question, better off using the Jeremy Howde Wikipedia page than the Tony Blair Wikipedia page. So if you had a few hundred documents you were thinking of using to give back to the model as context to help it answer a question, you could literally just pass them all through to encode, go through each one one at a time, and see which is closest. When you’ve got thousands or millions of documents, you can use something called a vector database, where basically as a one-off thing, you go through and you encode all of your documents.
Vector Databases for RAG
And so in fact, there’s lots of pre-built systems for this. Here’s an example of one called H2O-GPT.
[01:17:02]
And this is just something that I’ve got running here on my computer. It’s just an open source thing written in Python, sitting here running on port 7860. And so I’ve just gone to localhost 7860. And what I did was I just clicked upload and uploaded a bunch of papers. In fact, I might be able to see it better. Yeah, here we go. A bunch of papers. And so, you know, we could look at… can we search? Yeah, I can. So for example, we can look at the ULMFit paper that Seb Bruder and I did. And you can see it’s taken the PDF and turned it into, slightly crappily, a text format. And then it’s created an embedding for each you know, each section.
[01:18:02]
So I could then ask it, you know, what is ULMFit? And I’ll hit enter. And you can see here it’s now actually saying based on the information provided in the context. So it’s showing us, it’s been given some context. What context did it get? So here are the things that it found, right? So it’s being sent this context. So this is kind of citations. A goal of ULMFit proves the performance by leveraging the knowledge and adapting it to the specific task at hand. How, what techniques, be more specific, does ULMFit… let’s see how it goes.
[01:19:00]
Okay, there we go. So here’s the three steps. Pre-train, fine-tune, fine-tune. Cool. So you can see it’s not bad, right? It’s not amazing. Like, you know, the context in this particular case is pretty small. And it’s, and in particular, if you think about how that embedding thing worked, you can’t really use, like, the normal kind of follow-up. So for example, if I, so it says fine-tuning a classifier. So I could say what classifier is used. Now the problem is that there’s no context here being sent to the embedding model. So it’s actually going to have no idea I’m talking about ULMFit. So generally speaking, it’s going to do a terrible job. Yeah, see, it says used as a Roberta model, but it’s not. But if I look at the sources, it’s no longer actually referring to Howard and Ruder. So anyway, you can see the basic idea. This is called retrieval augmented generation, R-A-G.
[01:20:04]
And it’s a, it’s a nifty approach, but you have to do it with, with some care. And so there are lots of these private GPT things out there. Actually, the H2O-GPT webpage does a fantastic job of listing lots of them and comparing. So as you can see, if you want to run a private GPT, there’s no shortage of options. And you can have your retrieval augmented generation. I haven’t tried, I’ve only tried this one, H2O-GPT. I don’t love it. It’s all right.
Fine Tuning Language Models
Good. So finally, I want to talk about what’s perhaps the most interesting option we have, which is to do our own fine tuning. And fine tuning is cool because rather than just retrieving documents, which might have useful context, we can actually change our model to behave based on the documents that we have available.
[01:21:10]
I’m going to show you a really interesting example of fine tuning here. What we’re going to do is we’re going to fine tune using this NoSQL dataset. And it’s got examples of like a schema for a table in a database, a question, and then the answer is the correct SQL to solve that question using that database schema. And so I’m hoping we could use this to create a, you know, a kind of, it could be a handy use, a handy tool for business users where they type some English question and SQL generated for them automatically.
[01:22:01]
Fine Tuning for SQL Generation
Don’t know if it’d actually work in practice or not, but this is just a little fun idea I thought we’d try out. I know there’s lots of startups and stuff out there trying to do this more seriously, but this is quite cool because it actually got it working today in just a couple of hours. So what we do is we use the HuggingFace datasets library. And what that does, just like the HuggingFace hub has lots of models stored on it, HuggingFace datasets has lots of datasets stored on it. And so instead of using transformers, which is what we use to grab models, we use datasets and we just pass in the name of the person and the name of their repo and it grabs the dataset. And so we can take a look at it and it just has a training set with features. And so then I can have a look at the training set.
[01:23:02]
So here’s an example which looks a bit like what we’ve just seen. So what we do now is we want to fine tune a model. Now we can do that in a notebook from scratch. Takes, I don’t know, a hundred or so lines of code. It’s not too much. But given the time constraints here, and also like I thought, why not, why don’t we just use something that’s ready to go? So for example, there’s something called Axolotl, which is quite nice in my opinion. Here it is here. Another very nice open source piece of software. And again, you can just pip install it. And it’s got things like GPTQ and 16-bit and so forth ready to go. And so what I did was, it basically has a whole bunch of examples of things that it already knows how to do. It’s got Llama2 examples. So I copied the Llama2 example and I created a SQL example.
[01:24:05]
So I basically just told it this is the path to the dataset that I want. This is the type. And everything else pretty much I left the same. And then I just ran this command, which is from the readme, accelerate launch Axolotl, passed in my YAML. And that took about an hour on my GPU. And at the end of the hour, it had created a qloraout directory. Q stands for quantize. That’s because I was creating a smaller quantized model. Lora I’m not going to talk about today, but Lora is a very cool thing that basically, another thing that makes your models smaller and also handles, can use bigger models on smaller GPUs for training. So I trained it and then I thought, okay, let’s create our own one.
[01:25:04]
So we’re going to have this context and this question, get the count of competition hosts by theme. And I’m not going to pass it an answer. So I’ll just ignore that. So again, I found out what prompt they were using and created a SQL prompt function. And so here’s what I’m going to do. Use the following contextual information to answer the question. Context, create table, so there’s the context. Question, list all competition hosts sorted in ascending order. And then I tokenized that, called generate. And the answer was select count hosts comma theme from farm competition group by theme.
[01:26:00]
That is correct. So I think that’s pretty remarkable. We have just built, you know, so it took me like an hour to figure out how to do it. And then an hour to actually do the training. And at the end of that, we’ve actually got something which is converting prose into SQL based on a schema. So I think that’s a really exciting idea.
Using Macs for Language Modeling
The only other thing I do want to briefly mention is doing stuff on Macs. If you’ve got a Mac, there’s a couple of really good options. The options are MLC and llama.cpp currently. MLC in particular, I think it’s kind of underappreciated. It’s a, you know, really nice project where you can run language models on literally iPhone, Android web browsers, everything.
[01:27:08]
It’s really cool.
MLC: Running Language Models on Macs and Mobile Devices
And so I’m now actually on my Mac here and I’ve got a tiny little Python program called chat. And it’s going to import chat module and it’s going to import a discretized 7b. And that’s going to ask the question, what is the meaning of life? So let’s try it. Python chat.py. Again, I just installed this earlier today. I haven’t done that much stuff on Macs before, but I was pretty impressed to see that it is doing a good job here. What is the meaning of life is complex and philosophical.
[01:28:03]
Some people might find meaning in their relationships with others, their impact in the world, et cetera, et cetera. Okay. And it’s doing 9.6 tokens per second. So there you go. So there is running a model on a Mac.
Llama.cpp: Another Option for Macs and CUDA
And then another option that you’ve probably heard about is llama.cpp. Llama.cpp runs on lots of different things as well, including Macs and also on CUDA. It uses a different format called gguf. And again, you can use it from Python, even though it’s a CPP thing. It’s got a Python wrapper. So you can just download, again, from HuggingFace, a gguf file. So you can just go through and there’s lots of different ones. They’re all documented as to what’s what. You can pick how big a file you want. You can download it. And then you just say, okay, llama model path equals pass in that gguf file.
[01:29:03]
It spits out lots and lots and lots of gunk. And then you can say, okay, so if I called that llm, you can then say llm question, name the planets of the solar system, 32 tokens. And there we are. One, Pluto, no longer considered a planet. Two, Mercury. Three, Venus. Four, Earth, Mars. Six, oh, and I’ve run out of tokens. So again, you know, it’s just to show you here, there are all these different options.
Choosing the Right Tools
You know, I would say, you know, if you’ve got a NVIDIA graphics card and you’re a reasonably capable Python programmer, you’d probably be wanting to use PyTorch and the HuggingFace ecosystem. But, you know, these things might change over time as well. And certainly a lot of stuff is coming into Llama pretty quickly now, and it’s developing very fast. As you can see, there’s a lot of stuff that you can do right now with language models, particularly if you feel pretty comfortable as a Python programmer.
[01:30:11]
Conclusion and Call to Action
I think it’s a really exciting time to get involved. In some ways it’s a frustrating time to get involved, because, you know, it’s very early, and a lot of stuff has weird little edge cases, and it’s tricky to install and stuff like that. There’s a lot of great Discord channels on Fast.ai. We have our own Discord channel, so feel free to just Google for Fast.ai Discord and drop in. We’ve got a channel called Generative. You feel free to ask any questions or tell us about what you’re finding. Yeah, it’s definitely something where you want to be getting help from other people on this journey, because it is very early days, and, you know, people are still figuring things out as we go. But I think it’s an exciting time to be doing this stuff, and I’m really enjoying it.
[01:31:03]
I hope that this has given some of you a useful starting point on your own journey. So I hope you found this useful. Thanks for listening. Bye.