My LLM – The First Degree View

For a number of reasons I am teaching myself how to estimate a Large Language Model. I am learning about embedding words and expressions in vectors, introducing semantics, clustering, classifications and building chatbots. The vision I have is to introduce an automated assistant to our business to better integrate our database into our daily processes. Along the way, I am curious to see if I can improve upon the methods for estimating LLM’s. So far I am absorbing tutorials but have yet to get my hands dirty with data.

I do have some observations, however, on the approach to estimation. This is very much from the “… add data to a bucket and stir…” genre of model building, where the desire to market a unique feature of a branded-LLM overtakes the statistical reality of the features presence in the data. Take the introduction of semantics, for instance. This idea seeks to append an emotional feeling to a word or sentence such as ‘anger’ or ‘happiness’, which is done by augmenting the main object of an expression with the emotion and estimating a logistic regression for the probability that the expression arises elsewhere with the emotion. “Out damn spot” becomes “out damn spot_anger” which then gets mixed with a training set that ascribes a probability that similar phrases are said in anger.

From what I can ascertain, there is no attempt to judge whether the anger-appended expression has statistical significance and therefore no indication that the semantic adds anything to the model’s ability to answer questions. It does, however, play well to an audience wanting to believe in an emotional computer. My feeling is that fabricating and estimating model attributes that are not in the data will lead to error and inefficiencies – perhaps even hallucinations? – much like an overspecified model in econometrics. The difference is that the embedding vector contains 4096 individual attributes and there are billions of tokens holding the fitted output…how could one go about pruning this model into something that is bare bones efficient? I dont know the answer to this question but solving this conundrum would contribute to the AI field and save a lot of computer resource in addition to imposing a discipline into model building.

You might also like