📊 Intelligent Data Management

Ask Sage Datasets

Organize, ingest, and leverage your organization's content with vector-powered datasets. Ground AI responses in your specific sources for accurate, contextually relevant results.

🚀 Get Started → 🧠 Learn About RAG

📑 Table of Contents

Overview
Getting Started
Understanding Tokens and Usage
Best Practices for Dataset Creation
Understanding RAG Technology
Technical Considerations

Overview

Ask Sage Datasets are organized collections of your organization's content—including text, images, and audio—that you ingest into the platform to ground AI-generated responses in your specific sources. Datasets enable you to ingest data once and reuse it across multiple prompts, models, and team members, ensuring consistent, accurate, and contextually relevant results.

🎯

Accuracy & Relevance

Generate responses grounded in your specific materials rather than generic web content

⚡

Efficiency & Reuse

Ingest content once and reuse across different prompts, models, and use cases

🤝

Team Collaboration

Share datasets to establish a single source of truth across your organization

🔍

Advanced Search

Use Search Datasets plugin to quickly locate specific facts and information

🔄

Model Flexibility

Use datasets with any GenAI model—never locked into a single provider

🛡️

CUI Support

Classify datasets as Unclassified or CUI for controlled information

Getting Started

Selecting Datasets in Prompt Settings

Access Data & Settings

Navigate to Data & Settings to select datasets that will provide context for your prompt:

Click the Data & Settings button or the Folder Icon below the prompt window
Select the dataset(s) you want to reference
Choose multiple datasets or select None if no dataset context is needed
Selected datasets will appear under the prompt window for easy identification

Understanding Attachments vs. Datasets

📎

Chat Attachments

Persistence One-time use only

Scope Single conversation

Sharing Not shareable

Use Case Quick, ad-hoc analysis

Token Type Inference tokens

📊

Datasets

Persistence Permanent storage for reuse

Scope Available across all prompts

Sharing Can be shared with team

Use Case Recurring reference material

Token Type Training + Inference tokens

💡

Important: Files you attach in a chat are for one-off use and are not automatically saved to a dataset unless you explicitly ingest them.

Creating and Ingesting Datasets

Create a New Dataset

Click Prompt Tools → Data & Settings → Upload New Files
Click Create New Dataset
Enter a dataset name (alphanumeric characters and hyphens only, e.g., my-dataset-2025)
Classify the dataset as Unclassified or CUI (Controlled Unclassified Information)
Click Create Dataset

⚠️

CUI Classification: CUI classification requires a CAC/PIV card or special activation. Contact support@asksage.ai to request this feature or reach out to your organization Administrator.

Upload Files to Dataset

Select your dataset from the dropdown list
Drag and drop files into the designated box, or click to browse your local machine
Review the file list and remove any unwanted files using the garbage bin icon
Click Ingest Files to begin the upload process
Look for the white checkmark and "Successfully Imported" message for each file

Supported File Formats (Not Exhaustive - more already supported)

📄

Documents

.pdf .doc .docx

📊

Spreadsheets

.xls .xlsx .csv

📽️

Presentations

.ppt .pptx

🖼️

Images

.jpg .png .gif .svg

📝

Text Files

.txt .rtf .log

🔧

Markup & Data

.html .xml .json .yaml

📏

Maximum File Size: 50MB per file

📸

Image Handling: Images embedded in text documents will not be extracted automatically—upload images separately. If you don't see a file type you need supported, email support@asksage.ai.

🤝

Share Datasets

Enter teammates' email addresses and confirm to grant access. Sharing ensures your team works from the same source of truth.

🗑️

Delete Datasets

Remove datasets you no longer need to maintain organization and optimize token usage.

📋

Copy Datasets

Duplicate dataset files for different use cases or teams without re-ingesting content.

📊

View Details

See ingested files, file counts, and dataset metadata at a glance.

Understanding Tokens and Usage

🎓

Training Tokens

Purpose Ingest data into datasets

When Consumed When uploading and processing files

Use Case Converting content into vector embeddings

⚡

Inference Tokens

Purpose Generate AI responses

When Consumed When submitting prompts and generating text

Use Case Querying datasets and producing content

📈

Monitoring Token Usage

Navigate to Settings → Tokens to view:

📊 Current subscription plan

⚡ Inference tokens used and remaining

🎓 Training tokens used and remaining

📅 Monthly reset date

⏰

Important: Tokens reset on the first day of each month and do not roll over. Plan your usage accordingly to maximize value.

Best Practices for Dataset Creation

Avoid Bloated Datasets

⚠️

Why It Matters

Overloading datasets with too many files can confuse models. RAG is limited by the model's context window, so only a finite amount of information can be processed per prompt.

✓ Keep datasets focused and purposeful

✓ Only include files directly relevant to the dataset's intended use case

✓ Regularly review and remove outdated or unnecessary content

✓ Consider context window limitations when determining dataset size

Maintain Specificity

🎯

Why It Matters

Mixing unrelated data can confuse the model and lead to irrelevant or incorrect responses.

❌ Bad Practice

Creating a dataset with both tank manuals and airplane manuals

Problem: If you ask "Tell me how to fix Model XY," and both a tank and plane share that model number, the AI might pull information about the wrong vehicle.

✅ Good Practice

Create separate datasets: tank-manuals-2025 and aircraft-manuals-2025

✓ Create separate datasets for different product lines, departments, or subject areas

✓ Use clear, descriptive dataset names that reflect specific content

✓ Avoid "catch-all" datasets covering multiple unrelated topics

Preprocess Your Files

🔧

Why It Matters

Removing unnecessary content ensures the model focuses on relevant information and improves retrieval accuracy.

Preprocessing Steps

✓ Remove cover pages, table of contents, appendices, or sections without substantive information

✓ Clean up formatting issues that might interfere with text extraction

✓ Ensure images are clear and properly labeled if they contain important information

✓ Verify text is machine-readable (avoid scanned documents with poor OCR quality)

For Large Documents

✓ Split large files into smaller chunks (e.g., 100-page document → five 20-page sections)

✓ Use the Summarization Plugin to condense content before ingestion

✓ For technical manuals, create separate datasets for different sections

Prioritize Quality Over Quantity

⭐

Quality Guidelines

✓ Prioritize high-quality, authoritative sources over volume

✓ Verify documents are current and accurate before ingestion

✓ Remove duplicate or redundant information

✓ Ensure text is machine-readable and well-formatted

Maintain Dataset Hygiene

🧹

Ongoing Maintenance

✓ Regularly audit datasets for outdated information

✓ Update datasets when source materials change

✓ Document what each dataset contains and its intended use case

✓ Use consistent naming conventions across your organization

✓ Archive or delete datasets that are no longer needed

Understanding RAG Technology

What Are Vector Datasets?

Ask Sage Datasets are vector databases that store your content as embeddings—numerical representations that capture the semantic meaning of your data. Unlike traditional databases that store raw files, vector databases enable:

🔍

Semantic Search

Find information based on meaning, not just keyword matching

⚡

Rapid Retrieval

Query large volumes of data efficiently

🧠

Context-Aware Responses

Generate answers that understand relationships between concepts

💡

Important: Ask Sage datasets store embeddings, not the original files. This design optimizes for search and retrieval rather than file storage.

How Embeddings Work

Tokenization

Your text is broken into tokens (units of text ranging from single characters to whole words)

Example: "I love programming!" = 5 tokens: ["I", "love", "programming", "!", " "]

Embedding Generation

Each token is mapped to a numerical vector that captures its semantic meaning

These vectors represent relationships and context between words and concepts

Vector Storage

Embeddings are stored in the vector database, optimized for similarity search

This process consumes Training Tokens based on content volume

What Is RAG (Retrieval Augmented Generation)?

Retrieval Augmented Generation (RAG) is the core technology that makes Ask Sage Datasets powerful. RAG enhances AI responses by combining your ingested data with the model's capabilities through a two-step process:

🔍

Step 1: Retrieve Relevant Context

When you submit a prompt with a dataset selected:

1 Your query is converted into an embedding vector

2 The system searches the vector database for semantically similar content

3 The most relevant passages, facts, and information are retrieved

4 This context is ranked by relevance to your specific query

✨

Step 2: Augment Prompt with Retrieved Data

The retrieved context is integrated with your original prompt:

1 Your original prompt is combined with relevant dataset excerpts

2 This augmented prompt provides the AI model with specific, grounded information

3 The model generates a response based on both its training and your retrieved data

4 The result is a contextually accurate answer grounded in your sources

Benefits of RAG with Vector Datasets

📅

Up-to-date Information

Overcome model training cutoff dates by referencing your current data

🎓

Domain Expertise

Provide specialized knowledge not present in general AI training

🛡️

Reduced Hallucinations

Ground responses in verifiable sources rather than model speculation

🔍

Transparency

Use explainability features to see which dataset content informed each response

🔄

Flexibility

Use the same dataset with different models without re-ingesting data

RAG vs. Traditional Approaches

No Dataset (Base Model)

How It Works Model relies only on training data

Limitations Outdated information, no organization-specific knowledge

Attachments Only

How It Works Files processed per conversation

Limitations No reusability, inefficient for recurring needs

Recommended

RAG with Datasets

How It Works Semantic search retrieves relevant context

Limitations Requires initial ingestion, consumes training tokens

Practical RAG Example

📋

Scenario

You've ingested your company's product documentation into a dataset called product-docs-2025.

❌ Without RAG (No Dataset)

Prompt:

"What are the warranty terms for our Model X product?"

Response:

Generic information about typical warranties, possibly inaccurate or irrelevant to your specific product.

✅ With RAG (Dataset Selected)

Prompt:

"What are the warranty terms for our Model X product?"

Process:

Vector search retrieves relevant warranty sections from your documentation

Response:

Specific warranty terms from your actual product documentation, with citations

Explainability:

Shows which document sections were referenced

💡

This demonstrates how RAG transforms generic AI into a knowledgeable assistant grounded in your organization's specific information.

Technical Considerations

🎯

Vector Database Optimization

Best Use Cases

✓ Vector databases excel at semantic search and retrieval

✓ Ideal for unstructured text, documents, and narrative content

✓ Perfect for finding conceptually similar information

Limitations

⚠️ Not designed for large tabular data (spreadsheets)

💡 For tabular data analysis, attach spreadsheets directly to prompts—Ask Sage will use Python libraries to analyze them

⚡

Token Efficiency Tips

✓ Selecting "None" for datasets saves inference tokens when dataset context isn't needed

✓ Only select relevant datasets to optimize token usage and response quality

✓ Use the Show Explainability feature to verify which dataset content was used

✓ Monitor token usage regularly in Settings → Tokens

🛡️

CUI Compliance

⚠️

Important: The Live feature is not CUI compliant and cannot be used with CUI-labeled datasets. Ensure proper classification when creating datasets containing sensitive information.

Ask Sage Datasets

Overview

Accuracy & Relevance

Efficiency & Reuse

Team Collaboration

Advanced Search

Model Flexibility

CUI Support

Getting Started

Selecting Datasets in Prompt Settings

Access Data & Settings

Understanding Attachments vs. Datasets

Chat Attachments

Datasets

Creating and Ingesting Datasets

Create a New Dataset

Upload Files to Dataset

Supported File Formats (Not Exhaustive - more already supported)

Documents

Spreadsheets

Presentations

Images

Text Files

Markup & Data

Managing and Sharing Datasets

Share Datasets

Delete Datasets

Copy Datasets

View Details

Understanding Tokens and Usage

Training Tokens

Inference Tokens

Monitoring Token Usage

Best Practices for Dataset Creation

Avoid Bloated Datasets

Why It Matters

Maintain Specificity

Why It Matters

Preprocess Your Files

Why It Matters

Preprocessing Steps

For Large Documents

Prioritize Quality Over Quantity

Quality Guidelines

Maintain Dataset Hygiene

Ongoing Maintenance

Understanding RAG Technology

What Are Vector Datasets?

Semantic Search

Rapid Retrieval

Context-Aware Responses

How Embeddings Work

Tokenization

Embedding Generation

Vector Storage

What Is RAG (Retrieval Augmented Generation)?

Step 1: Retrieve Relevant Context

Step 2: Augment Prompt with Retrieved Data

Benefits of RAG with Vector Datasets

RAG vs. Traditional Approaches

No Dataset (Base Model)

Attachments Only

RAG with Datasets

Practical RAG Example

Scenario

Technical Considerations

Vector Database Optimization

Best Use Cases

Limitations

Token Efficiency Tips

CUI Compliance