Ingesting Data into Ask Sage

This section guides you through ingesting data into a dataset within Ask Sage. This is essential for generating precise results tailored to your specific needs.

GenAI models use a wide range of open-source information. By incorporating targeted and relevant data, you can significantly enhance the quality and relevance of your outcomes. Ask Sage also facilitates dataset sharing across an organization, promoting collaboration and knowledge sharing.

1) Ask Sage allows you to ingest data in any format, including text, images, and audio, to generate text tailored to various use cases.

2) Ask Sage users benefit from ingesting data only once, enabling its use across multiple GenAI models on the platform.

Table of contents

Understanding Datasets, Tokens, and Embeddings
Steps to Ingest Data into Ask Sage
How RAG Works
1. Example
Summary

Understanding Datasets, Tokens, and Embeddings

Purpose of Ingesting Data into an Ask Sage Dataset

Ingesting data into an Ask Sage dataset allows users to merge their prompts with the ingested information, enabling the generation of customized text. This approach is particularly advantageous for organizations seeking tailored responses based on their unique data and knowledge. This method, known as Retrieval Augmented Generation (RAG), enhances the capabilities of Generative AI models by incorporating external information, leading to more accurate and contextually relevant outputs.

For a deeper understanding of RAG, please scroll to the bottom of the page, where we will explore how RAG functions and its applications within Ask Sage.

Ask Sage Training Tokens

When users ingest data into Ask Sage, the platform utilizes training tokens to convert that data into embeddings, which are then stored in an Ask Sage dataset. Think of training tokens as a form of currency that allows you to input data into the platform. Each month, users receive a specific number of training tokens based on their subscription plan.

To view your available tokens, navigate to the Settings and click on the Tokens tab. Here, you will find the counts for both Inference Tokens and Training Tokens.

Inference Tokens are consumed when generating text using the GenAI models on Ask Sage.
Training Tokens are required when ingesting data into a dataset on Ask Sage.

Please note that tokens reset at the beginning of each month and do not carry over to the next month.

What is a Token?

A token is a unit of text that the model processes, which can range from a single character to a whole word. For instance, the word “hello” is a single token, while the phrase “I love programming!” consists of five tokens: “I”, “love”, “programming”, “!”, and a space. When you ingest data into Ask Sage, the platform uses tokens to represent the text, converting it into a format that the model can understand. The more tokens you have, the more data you can ingest into an Ask Sage dataset.

What is an Embedding?

An embedding is a numerical representation of data that captures its meaning in a way that a model can interpret. Essentially, embeddings transform complex data—such as text or images—into a format that allows algorithms to analyze it effectively.

Tokens and embeddings are closely related: once the text is tokenized, each token is mapped to a corresponding embedding. This mapping allows the model to understand the relationships and meanings of the tokens in a more nuanced way, enabling it to generate relevant responses based on the ingested data.

Steps to Ingest Data into Ask Sage

Define a Dataset

The first step is to create a dataset in Ask Sage. A dataset is equivalent to a folder where you can store all the data you want to ingest into Ask Sage. You can create multiple datasets to organize your data based on specific use cases or projects.

To create a dataset, follow these steps:

Click on the data button, followed by selecting the Upload New Files button.
Click on the Create New Dataset button.
- Enter a dataset name. Only alphanumeric characters and hyphens are allowed. No spaces or special characters are allowed.(e.g., my-dataset12345).
- Classify the dataset as Unclassified, or CUI (Controlled Unclassified Information).
- Click on the Create Dataset button. (If successful, you will see Dataset created)

Users with a CAC/PIV card can label datasets as either CUI or Unclassified. In contrast, users without a CAC/PIV card are limited to labeling datasets as Unclassified. If you do not possess a CAC/PIV card but need to label a dataset as CUI, please contact Support at support@asksage.ai for assistance.

After creating a dataset, you can now start ingesting data into Ask Sage.

1) As a best practice, it is recommended have a clear naming convention for your datasets to easily identify them when ingesting data.

2) On your local machine, you can create a folder with the same name as the dataset you created in Ask Sage. This will help you organize your data locally and easily upload it to Ask Sage.

Upload/Ingest Data

Please refrain from ingesting data into Ask Sage workbooks through the dataset management page. Workbooks should only be managed via the workbook user interface. Attempting to ingest data through the dataset management page will not be successful.

After creating a dataset, you can now upload/ingest data into Ask Sage. You can ingest data in any format and as listed in the table below:

Data Type	File Format	Example	Max Size Per File
Text	.txt, .docx, .pdf, .pptx, .ppt, .csv, .cc, .sql, .cs, .hh, .c, .php, .js, .py, .html, .xml, .msg, .odt, .epub, .eml, .rtf, .doc, .json, .md, .tsv, .yaml, .yml, .java, .rb, .sh, .bat, .ps1	`example.txt`	50MB
Image	.jpg, .jpeg, .png	`example.jpg`	50MB
Audio	.wav, .mp3, .mp4, .mpeg, .mpga, .m4a, .webm	`example.wav`	500MB
Compressed	.zip	`example.zip`	50MB
Spreadsheet	.xlsx, .tsv	`example.xlsx`	50MB
Presentation	.pptx, .ppt	`example.pptx`	50MB
Code	.cc, .sql, .cs, .hh, .c, .php, .js, .py, .java, .rb, .sh, .bat, .ps1	`example.py`	50MB
E-book	.epub	`example.epub`	50MB
Email	.eml, .msg	`example.eml`	50MB
Rich Text	.rtf	`example.rtf`	50MB
Markup	.md, .html, .xml	`example.html`	50MB
Data Interchange	.json, .yaml, .yml	`example.json`	50MB

Be aware that images in text file documents will not be extracted. You will need to upload the images separately.

Ask Sage is capable of ingesting other files types as well and if you have any specific requirements, please reach out to the Ask Sage team for assistance.

To upload data into Ask Sage, navigate to the Ingest Files section and follow these steps:

Select the dataset you created from the dropdown list.
Drag and drop the files you want to upload into the designated box, or click inside the box to choose files from your local machine.
Once the files are selected, their names will appear in the box. Review the list to ensure accuracy, and use the garbage bin icon to remove any files you do not wish to upload.
Click the Ingest Files button to begin the upload process.
If the upload is successful, a white checkmark and the message Successfully Imported will appear next to each uploaded file.

The purpose of an Ask Sage dataset is to store embeddings rather than the original files. Consequently, the original files will not be included in the dataset. Instead, embeddings will be stored in a vector database optimized for rapid retrieval and search. This design enables the use of Ask Sage datasets with Retrieval Augmented Generation (RAG) to generate text based on the ingested data.

Vector embedding databases, like those provided by Ask Sage, are not intended for ingesting large tabular data, such as spreadsheets. If you have tabular data and wish to utilize GenAI for analysis, you can simply attach the spreadsheet to your prompt. Ask Sage will then use Python libraries to analyze the data and generate text based on that analysis. This approach allows you to effectively leverage GenAI for data analysis without the need to ingest the data into a dataset.

Using the Dataset with Any Ask Sage Models

After ingesting data into an Ask Sage Dataset, you can now use the vector dataset with any of the GenAI models available on the platform.

To use the data with the GenAI models, follow these steps:

Navigate and click on the Data button.
Select the dataset(s) you want to reference/use when prompting questions.
- Note: You can select multiple datasets.

Update any other settings as needed (e.g., Model, Persona, Temperature, etc.)
Enter a prompt and submit your prompt.

Here is an example of when a dataset is selected and used within Ask Sage:

The dataset(s) selected will appear when clicking on the Data button, but also under the prompt window so users can easily identify the dataset(s) used with the prompt.

For optimal results with the ingested data, we recommend keeping the Temperature setting at its default value of 0.0 and ensuring that the Live setting is turned off. Incorrect settings may result in subpar outcomes or data contamination.

The Live is not CUI compliant and can not be used with CUI labeled datasets.

The inference/response generated by the GenAI model utilizes the dataset assigned to the prompt.

Ask Sage users benefit from the Show Explainability feature, which provides users with a detailed reference to the data used to generate the text when using a dataset and/or the live feature. This is useful for understanding the context of the generated text and ensuring the text is relevant and not a hallucination.

Lastly, if you do not need to use a Ask Sage dataset, select None from the dropdown list. This will allow you to use the GenAI models without any dataset reference. This will also save inference tokens, as the model will not need to reference any dataset.

RAG Flow Chart

How RAG Works

Retrieval Augmented Generation (RAG) is a two-step process that enhances the capabilities of Generative AI models by incorporating external information:

Retrieve Relevant Context:
- The model identifies and retrieves information from a database or knowledge base that is pertinent to the user’s query or task. This context provides essential background or specific details that may not be present in the model’s training data.
Augment Prompt with Retrieved Data:
- The retrieved information is combined with the original user prompt. This enriched prompt is then fed into the Generative AI model, enabling it to generate responses that are more accurate, contextually relevant, and informative.

Example

User Prompt: “What are the best practices for securing a web application?”

Retrieve Relevant Context:
- The RAG system searches a database and finds relevant articles, guidelines, and facts about web application security, such as:
  - Use HTTPS to encrypt data in transit
  - Implement input validation to prevent injection attacks
  - Regularly update software dependencies to patch vulnerabilities
Augment Prompt:
- The system combines the original prompt with the retrieved information:
- Augmented Prompt: “What are the best practices for securing a web application? Use HTTPS to encrypt data in transit, implement input validation to prevent injection attacks, and regularly update software dependencies to patch vulnerabilities.”
Generate Response:
- The Generative AI model processes the augmented prompt and generates a more informed response:
- “Securing a web application involves several best practices. First, always use HTTPS to encrypt data in transit, ensuring that sensitive information is protected from eavesdroppers. Second, implement robust input validation to prevent injection attacks, such as SQL injection or cross-site scripting (XSS). Lastly, regularly update your software dependencies to patch known vulnerabilities and reduce the risk of exploitation.”

Summary

In this section, we guided you through the process of ingesting data into Ask Sage. Understanding this process is crucial to generating relevant and accurate results relevant to your work/organization.

Now that you have a better understanding of how to ingest data into Ask Sage, you are ready to start utilizing the platform and leveraging the power of GenAI!

Proceed to the next sections to learn more about Ask Sage! 🚀

Dataset Management