Data Ingestion Guide

Ingesting Data into Ask Sage

Transform your data into powerful AI insights with Ask Sage datasets. Ingest once, use everywhere across all GenAI models.

Data being ingested into Ask Sage platform
Key Benefits:
  • Ingest data in any format—text, images, audio—to generate tailored responses
  • Upload once, use across multiple GenAI models on the platform
  • Share datasets organization-wide for seamless collaboration
Table of Contents
  1. Understanding Datasets, Tokens, and Embeddings
  2. Steps to Ingest Data into Ask Sage
  3. Using the Dataset with Ask Sage Models
  4. How RAG Works

Understanding Datasets, Tokens, and Embeddings

Understanding Datasets, Tokens, and Embeddings

Purpose of Ingesting Data into an Ask Sage Dataset

Ingesting data into an Ask Sage dataset allows users to merge their prompts with the ingested information, enabling the generation of customized text. This approach is particularly advantageous for organizations seeking tailored responses based on their unique data and knowledge. This method, known as Retrieval Augmented Generation (RAG), enhances the capabilities of Generative AI models by incorporating external information, leading to more accurate and contextually relevant outputs.

Learn More: For a deeper understanding of RAG, scroll to the bottom of this page where we explore how RAG functions and its applications within Ask Sage.

Ask Sage Training Tokens

When users ingest data into Ask Sage, the platform utilizes training tokens to convert that data into embeddings, which are then stored in an Ask Sage dataset. Think of training tokens as a form of currency that allows you to input data into the platform. Each month, users receive a specific number of training tokens based on their subscription plan.

To view your available tokens, navigate to the Settings and click on the Tokens tab. Here, you will find the counts for both Inference Tokens and Training Tokens.

Token Overview showing inference and training token counts

Inference Tokens

Consumed when generating text using GenAI models on Ask Sage

Training Tokens

Required when ingesting data into a dataset on Ask Sage

Important: Tokens reset at the beginning of each month and do not carry over to the next month.

What is a Token?

A token is a unit of text that the model processes, which can range from a single character to a whole word. For instance, the word "hello" is a single token, while the phrase "I love programming!" consists of five tokens: "I", "love", "programming", "!", and a space. When you ingest data into Ask Sage, the platform uses tokens to represent the text, converting it into a format that the model can understand. The more tokens you have, the more data you can ingest into an Ask Sage dataset.

What is an Embedding?

An embedding is a numerical representation of data that captures its meaning in a way that a model can interpret. Essentially, embeddings transform complex data—such as text or images—into a format that allows algorithms to analyze it effectively.

Tokens and embeddings are closely related: once the text is tokenized, each token is mapped to a corresponding embedding. This mapping allows the model to understand the relationships and meanings of the tokens in a more nuanced way, enabling it to generate relevant responses based on the ingested data.


Steps to Ingest Data into Ask Sage

Steps to Ingest Data into Ask Sage

Define a Dataset

The first step is to create a dataset in Ask Sage. A dataset is equivalent to a folder where you can store all the data you want to ingest into Ask Sage. You can create multiple datasets to organize your data based on specific use cases or projects.

To create a dataset, follow these steps:

Ask Sage interface showing Prompt Tools, Data & Settings, and Upload New Files buttons
  • Click the Prompt Tools button, then select Data & Settings button. After that, choose the Upload New Files button.
Quick Access: To quickly access the datasets, click on the Folder Icon located below the prompt window.
Folder icon for quick dataset access
Create New Dataset dialog window

Create New Dataset

  • Click on the Create New Dataset button.
    • Enter a dataset name. Only alphanumeric characters and hyphens are allowed. No spaces or special characters are allowed (e.g., my-dataset12345).
    • Classify the dataset as Unclassified, or CUI (Controlled Unclassified Information).
    • Click on the Create Dataset button. (If successful, you will see Dataset created)
CAC/PIV Card Access: Users with a CAC/PIV card can label datasets as either CUI or Unclassified. Users without a CAC/PIV card are limited to labeling datasets as Unclassified. If you need to label a dataset as CUI but do not possess a CAC/PIV card, please contact Support at support@asksage.ai for assistance.
Dataset creation success confirmation

After creating a dataset, you can now start ingesting data into Ask Sage.

Best Practices:
  • Use a clear naming convention for your datasets to easily identify them when ingesting data
  • On your local machine, create a folder with the same name as the dataset you created in Ask Sage to help organize your data locally and easily upload it

Upload/Ingest Data

Warning: Please refrain from ingesting data into Ask Sage workbooks through the dataset management page. Workbooks should only be managed via the workbook user interface. Attempting to ingest data through the dataset management page will not be successful.

Supported File Types

After creating a dataset, you can now upload/ingest data into Ask Sage. You can ingest data in any format and as listed in the table below:

Data Type File Format Example Max Size Per File
Text .txt, .docx, .pdf, .pptx, .ppt, .csv, .cc, .sql, .cs, .hh, .c, .php, .js, .py, .html, .xml, .msg, .odt, .epub, .eml, .rtf, .doc, .json, .md, .tsv, .yaml, .yml, .java, .rb, .sh, .bat, .ps1 example.txt 50MB
Image .jpg, .jpeg, .png example.jpg 50MB
Audio .wav, .mp3, .mp4, .mpeg, .mpga, .m4a, .webm example.wav 500MB
Compressed .zip example.zip 50MB
Spreadsheet .xlsx, .tsv example.xlsx 50MB
Presentation .pptx, .ppt example.pptx 50MB
Code .cc, .sql, .cs, .hh, .c, .php, .js, .py, .java, .rb, .sh, .bat, .ps1 example.py 50MB
E-book .epub example.epub 50MB
Email .eml, .msg example.eml 50MB
Rich Text .rtf example.rtf 50MB
Markup .md, .html, .xml example.html 50MB
Data Interchange .json, .yaml, .yml example.json 50MB
Note: Be aware that images in text file documents will not be extracted. You will need to upload the images separately.
Additional File Types: Ask Sage is capable of ingesting other file types as well. If you have any specific requirements, please reach out to the Ask Sage team for assistance.

Upload Steps

To upload data into Ask Sage, navigate to the Ingest Files section and follow these steps:

  1. Select the dataset you created from the dropdown list.
  2. Drag and drop the files you want to upload into the designated box, or click inside the box to choose files from your local machine.
  3. Once the files are selected, their names will appear in the box. Review the list to ensure accuracy, and use the garbage bin icon to remove any files you do not wish to upload.
  4. Click the Ingest Files button to begin the upload process.
  5. If the upload is successful, a white checkmark and the message Successfully Imported will appear next to each uploaded file.
Successful file ingestion with checkmarks and success messages
Vector Embeddings: The purpose of an Ask Sage dataset is to store embeddings rather than the original files. Consequently, the original files will not be included in the dataset. Instead, embeddings will be stored in a vector database optimized for rapid retrieval and search. This design enables the use of Ask Sage datasets with Retrieval Augmented Generation (RAG) to generate text based on the ingested data.
Tip: Vector embedding databases, like those provided by Ask Sage, are not intended for ingesting large tabular data, such as spreadsheets. If you have tabular data and wish to utilize GenAI for analysis, you can simply attach the spreadsheet to your prompt. Ask Sage will then use Python libraries to analyze the data and generate text based on that analysis. This approach allows you to effectively leverage GenAI for data analysis without the need to ingest the data into a dataset.

Using the Dataset with Ask Sage Models

Using the Dataset with Any Ask Sage Models

After ingesting data into an Ask Sage Dataset, you can now use the vector dataset with any of the GenAI models available on the platform.

To use the data with the GenAI models, follow these steps:

  • Navigate and click on the Data button.
  • Select the dataset(s) you want to reference/use when prompting questions.
    • Note: You can select multiple datasets.
Dataset selection interface showing available datasets
  • Update any other settings as needed (e.g., Model, Persona, Temperature, etc.)
  • Enter a prompt and submit your prompt.

Here is an example of when a dataset is selected and used within Ask Sage:

Ask Sage interface showing dataset being used in a query

The dataset(s) selected will appear when clicking on the Data button, but also under the prompt window so users can easily identify the dataset(s) used with the prompt.

Best Practice: For optimal results with the ingested data, we recommend keeping the Temperature setting at its default value of 0.0 and ensuring that the Live setting is turned off. Incorrect settings may result in subpar outcomes or data contamination.
Warning: The Live feature is not CUI compliant and cannot be used with CUI labeled datasets.
Ask Sage explainability feature showing data sources

The inference/response generated by the GenAI model utilizes the dataset assigned to the prompt.

Explainability Feature: Ask Sage users benefit from the Show Explainability feature, which provides users with a detailed reference to the data used to generate the text when using a dataset and/or the live feature. This is useful for understanding the context of the generated text and ensuring the text is relevant and not a hallucination.
Tip: If you do not need to use an Ask Sage dataset, select None from the dropdown list. This will allow you to use the GenAI models without any dataset reference. This will also save inference tokens, as the model will not need to reference any dataset.

How RAG Works

How RAG Works

RAG Flow Chart showing the retrieval and augmentation process

Retrieval Augmented Generation (RAG) is a two-step process that enhances the capabilities of Generative AI models by incorporating external information:

1

Retrieve Relevant Context

The model identifies and retrieves information from a database or knowledge base that is pertinent to the user's query or task. This context provides essential background or specific details that may not be present in the model's training data.

2

Augment Prompt with Retrieved Data

The retrieved information is combined with the original user prompt. This enriched prompt is then fed into the Generative AI model, enabling it to generate responses that are more accurate, contextually relevant, and informative.

Example

User Prompt: "What are the best practices for securing a web application?"

  1. Retrieve Relevant Context:
    • The RAG system searches a database and finds relevant articles, guidelines, and facts about web application security, such as:
      • Use HTTPS to encrypt data in transit
      • Implement input validation to prevent injection attacks
      • Regularly update software dependencies to patch vulnerabilities
  2. Augment Prompt:
    • The system combines the original prompt with the retrieved information:
    • Augmented Prompt: "What are the best practices for securing a web application? Use HTTPS to encrypt data in transit, implement input validation to prevent injection attacks, and regularly update software dependencies to patch vulnerabilities."
  3. Generate Response:
    • The Generative AI model processes the augmented prompt and generates a more informed response:
    • "Securing a web application involves several best practices. First, always use HTTPS to encrypt data in transit, ensuring that sensitive information is protected from eavesdroppers. Second, implement robust input validation to prevent injection attacks, such as SQL injection or cross-site scripting (XSS). Lastly, regularly update your software dependencies to patch known vulnerabilities and reduce the risk of exploitation."

Table of contents


Back to top

Copyright © 2026 Ask Sage Inc. All Rights Reserved. Ask Sage is a BigBear.ai company.