Do You Actually Know Something About LLM Settings?

Temperature: Controls randomness in the output. A higher value (e.g., 0.7 to 1.0) produces more random outputs, while a lower value (e.g., 0.2 to 0.5) makes it more focused and deterministic. My responses typically use a balanced temperature around 0.7.

Max Tokens: Determines the maximum number of tokens (words or subword units) in a response. The length of the response depends on this value, usually capped around 4096 tokens, including both input and output.

Top_p (Nucleus Sampling): Defines a probability threshold to limit the output to a subset of possible words. A value of 0.9 or lower is common, which keeps only the most likely token choices until their probabilities sum to 90%.

Frequency Penalty: Adjusts the likelihood of repeating phrases or words. Values typically range from 0 to 2, with 0 meaning no penalty and higher values reducing repetition.

Presence Penalty: Encourages or discourages the model from including new information. Similar to the frequency penalty, values usually range from 0 to 2.

For a more customized or production-based model, these parameters can be adjusted depending on the specific needs for creativity or accuracy. But, if we're working on an AI app MVP, we can tune these settings based on your desired output style or even leave it at default value!

Example of Openai ChatGPT:

For these models, it's likely that the parameters are set to balanced defaults to provide coherent, helpful, and consistent responses. Here's what we can infer based on typical configurations:

Temperature: Likely around 0.7 to balance creativity and coherence.

Max Tokens: Probably set to 4096 tokens (shared between input and output).

Top_p: Likely around 0.9, which limits responses to more likely tokens while allowing some variability.

Frequency Penalty: Likely set to 0 or a low value to avoid repetitive responses.

Presence Penalty: Likely set low, around 0 to 0.6, allowing a balance between new ideas and focused, consistent responses.

These settings are typical for general-purpose language models aiming to produce natural, useful interactions across a range of topics.

Then, why are RAGs using temperature 0?

When using a Retrieval-Augmented Generation (RAG) model, the temperature is often set to 0 to ensure deterministic and factual outputs:

Deterministic Behavior: With a temperature of 0, the model always selects the most probable next token, ensuring that it produces consistent responses for the same input. This is critical in retrieval-based systems where factual accuracy and consistency are paramount.

Factual Accuracy: RAG models often pull information from external sources (retrieved documents, knowledge bases, etc.). By setting the temperature to 0, the model is more likely to accurately reflect the retrieved information, reducing the chance of introducing hallucinations or unnecessary variability.

Focus on Precision: In tasks like summarization or question-answering, where the goal is to convey precise, factual information, a higher temperature could introduce creative or unexpected answers. By using a temperature of 0, RAG models focus solely on accuracy.

In contrast, a higher temperature would make the output more creative or varied, which isn’t ideal when you're relying on external, factual data to provide answers.

When to use 0 and when to use 0.7? As this is the easiest LLM setting to play with.

When we're using multiple LLM calls and actions within an agent that combines retrieval from user-embedded webpages and additional online searches, temperature plays a key role in balancing between factual accuracy and creativity. What should we probably do? :

When to Use Temperature 0:

Retrieval-Based Responses:

When the agent is answering factual questions based on retrieved information from user webpages or vectors.

Ensuring deterministic behavior is essential for accuracy. You don’t want the model to introduce variability when it needs to focus strictly on what is retrieved.

Example: After searching a vector database for relevant information, use temperature 0 to ensure the response reflects exactly what was found without creative variation.

Summarizing Factual Data:

When summarizing documents, articles, or user-provided information where factual accuracy is critical.

This ensures that the summary stays close to the source material.

Report Generation Based on External Data:

When creating reports or structured outputs based on both retrieved vectors and online search results, a temperature of 0 is ideal to ensure consistency and avoid any creativity that could deviate from factual findings.

When to Use Temperature 0.7:

Generating New Ideas or Creative Text:

When the agent is tasked with generating more open-ended content, like suggestions, brainstorming, or creating reports with a narrative structure. This setting encourages the model to introduce some variability and novelty.
Example: If, after analyzing user-provided data, the agent needs to provide recommendations, insights, or future strategies, a temperature of 0.7 adds creative flair to the output.

Less Fact-Critical Tasks:

For tasks where the precision of the output is less critical, like casual explanations, general discussions, or where you want the model to be more engaging and dynamic.

For Advanced Architectures

For complex workflows, where there are many LLM calls with both retrieval and additional online searches, you can dynamically adjust the temperature based on the context:

Set temperature to 0 for retrieval and factual reporting stages.
Switch to a higher temperature (0.7) when generating text that benefits from creative and varied responses, such as providing recommendations or summaries that combine data and insights in a more engaging way.

By adjusting the temperature based on the task, you can ensure both factual accuracy and creativity in your agent’s responses.

We have seen the most important LLM settings or let's say the ones that we could have access to as users, but professionals have access to even more settings. We have defined those and then focused more on the one that we probably all use the temperature and seen that you can adapt it depending on which kind of LLM call you are doing.

Nothing is better than hand-on to have a feel of it and understand what really happens when changing those values. Many playgrounds exist online to try and the proprietary APIs give us more details in their documentation.

Chikara Houses have some other articles about AI and IT in general, some are a collab with @Creditizens Youtube channel and have videos and example code snippets to get an idea. Continue Reading

Discover also about Chikara Houses:

9 Rules Rooms:

5 Needs Rooms:

#adjustmaxtokensinAI #AImodeltemperaturesetting #AIsettingsforaccuracy #customizeAImodel #frequencypenaltyAI

Item added to your cart

Do You Actually Know Something About LLM Settings?

Disclosure: This post may contain affiliate links, meaning Chikara Houses get a commission if you decide to make a purchase through our links, at no cost to you. Please read our disclosure for more info.